Inference quality #997
Replies: 2 comments 2 replies
-
My guess is that our sampling is just bad. I saw Notebook for original LLaMa to inference at Colab and there is much more advanced sampling techniques, such as Tail Free sampling. |
Beta Was this translation helpful? Give feedback.
-
So essentially if you remove all randomness from the sampling and always pick the most likely token (minus the repetition penalty), it will produce the highest quality text, which seems to be exactly what could be expected. But then the generation will always be identical, which defeats the purpose of sampling in the first place. If we assume that we always want some value of top_k > 1 and top_p > 0, then perhaps it would still be interesting to find the optimal parameters for the repetition penalty. |
Beta Was this translation helpful? Give feedback.
-
The primary challenge in text generation inference is the quality of the generated text. The more errors present, the less accurate the response will be. Poor sampling choices can lead to an increase in errors.
I have developed a script that aims to optimize parameters, specifically Top_K, Top_P, repeat_last_n, repeat_penalty, and temperature, for the LLaMa 7B model. My "objective" metric is based on the BERTScore Recall between the model's prediction and the Ground Truth Answer (GTA). This objective should be minimized using an optimization algorithm, such as a Genetic Algorithm or Bayesian optimization. To facilitate this, I calculate the objective score as
score(prediction, gta) = -BERTScore_R(prediction, gta)
.My prompt is straightforward: provide a concise summary of a conversation, along with a few shots of conversation that LLaMa should logically continue.
For the Ground Truth Answer, I use a reworded summary that matches the conversation.
Prompt I use as benchmark is fully written in Russian, and even changing the temperature leads to lots of grammar errors and non-existing words.
Why do I choose BERTScore as my objective score? I don't know. It may be harmonic mean of BERTScore, grammar and repeats... I found that repeats, if they occur, does not lower BERTScore_P, while lowers BERTScore_R. I choose BERTScore Recall because it is shows "how many relevant items are retrived", while F1 is a harmonic mean of precision and recall, but conversation may be not precise, I allow it to be more creative.
After spending time identifying the best combination of parameters, subjectively ranking answers, and using Bayesian optimization, I discovered that the model tends to favor Top_K = 1 and Top_P = 0. This result seems peculiar. Regarding the other parameters, for ctx_size = 1024, n_predict = 512, and ignore_eos = True, the model prefers repeat_last_n = 359, repeat_penalty = 1.1876426654180257, and temperature = 0.24598246046698435. The temperature parameter appears to be less significant due to the low values chosen for Top_K and Top_P, but from my perspective, it should be around 0.4.
Here is the script I mentioned:
bob.py.txt
And results I discovered so far.
Iterations
Objective
Evaluations
Beta Was this translation helpful? Give feedback.
All reactions