Inference quality #997

ivanstepanovftw · 2023-04-15T13:59:25Z

ivanstepanovftw
Apr 15, 2023
Collaborator

The primary challenge in text generation inference is the quality of the generated text. The more errors present, the less accurate the response will be. Poor sampling choices can lead to an increase in errors.

I have developed a script that aims to optimize parameters, specifically Top_K, Top_P, repeat_last_n, repeat_penalty, and temperature, for the LLaMa 7B model. My "objective" metric is based on the BERTScore Recall between the model's prediction and the Ground Truth Answer (GTA). This objective should be minimized using an optimization algorithm, such as a Genetic Algorithm or Bayesian optimization. To facilitate this, I calculate the objective score as score(prediction, gta) = -BERTScore_R(prediction, gta).

My prompt is straightforward: provide a concise summary of a conversation, along with a few shots of conversation that LLaMa should logically continue.

For the Ground Truth Answer, I use a reworded summary that matches the conversation.

Prompt I use as benchmark is fully written in Russian, and even changing the temperature leads to lots of grammar errors and non-existing words.

Why do I choose BERTScore as my objective score? I don't know. It may be harmonic mean of BERTScore, grammar and repeats... I found that repeats, if they occur, does not lower BERTScore_P, while lowers BERTScore_R. I choose BERTScore Recall because it is shows "how many relevant items are retrived", while F1 is a harmonic mean of precision and recall, but conversation may be not precise, I allow it to be more creative.

After spending time identifying the best combination of parameters, subjectively ranking answers, and using Bayesian optimization, I discovered that the model tends to favor Top_K = 1 and Top_P = 0. This result seems peculiar. Regarding the other parameters, for ctx_size = 1024, n_predict = 512, and ignore_eos = True, the model prefers repeat_last_n = 359, repeat_penalty = 1.1876426654180257, and temperature = 0.24598246046698435. The temperature parameter appears to be less significant due to the low values chosen for Top_K and Top_P, but from my perspective, it should be around 0.4.

Here is the script I mentioned:

bob.py.txt

And results I discovered so far.

Found 221 cached points.
...
Optimization finished.
Best evaluation: ({'top_k': 5731, 'top_p': 0.0, 'repeat_last_n': 359, 'repeat_penalty': 1.187205117042186, 'temp': 1.9759860911701692}, -0.685950756072998)
Expected minimum: ({'top_k': 5731.0, 'top_p': 0.07144144262857692, 'repeat_last_n': 359.0, 'repeat_penalty': 1.1876426654180257, 'temp': 0.24598246046698435}, -0.634880679020653)

Iterations

Objective

Evaluations

ivanstepanovftw · 2023-04-15T14:10:54Z

ivanstepanovftw
Apr 15, 2023
Collaborator Author

My guess is that our sampling is just bad. I saw Notebook for original LLaMa to inference at Colab and there is much more advanced sampling techniques, such as Tail Free sampling.

1 reply

slaren Apr 15, 2023
Collaborator

Tail Free sampling looks interesting and we should definitely try that. Additionally, looking at the OpenAI API can give some hints on what sampling method they use. It seems that the main differences compared to ours is the removal of top_k and a different kind of repetition penalty.

slaren · 2023-04-15T14:12:31Z

slaren
Apr 15, 2023
Collaborator

the model tends to favor Top_K = 1 and Top_P = 0

So essentially if you remove all randomness from the sampling and always pick the most likely token (minus the repetition penalty), it will produce the highest quality text, which seems to be exactly what could be expected. But then the generation will always be identical, which defeats the purpose of sampling in the first place.

If we assume that we always want some value of top_k > 1 and top_p > 0, then perhaps it would still be interesting to find the optimal parameters for the repetition penalty.

1 reply

ivanstepanovftw Apr 15, 2023
Collaborator Author

Categorical sampling (temp=0) can and will eliminate randomness, but due to a training objective to find most probable next token, it tends to have repetitions. So, there is repetition penalty comes, that even though if Top_K is 1, it still will penalize new tokens. After all, it is not so smart, because in my case it is a dialogue between two people. And setting repetition too low -> categorical sampling, changing too high -> dialogue becames monologue. Changing something in between -> monologue of two or more people ¯\(ツ)/¯. Increasing TopK & TopP, decreasing temperature -> grammar errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference quality #997

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Inference quality #997

ivanstepanovftw Apr 15, 2023 Collaborator

Here is the script I mentioned:

And results I discovered so far.

Iterations

Objective

Evaluations

Replies: 2 comments · 2 replies

ivanstepanovftw Apr 15, 2023 Collaborator Author

slaren Apr 15, 2023 Collaborator

slaren Apr 15, 2023 Collaborator

ivanstepanovftw Apr 15, 2023 Collaborator Author

ivanstepanovftw
Apr 15, 2023
Collaborator

Replies: 2 comments 2 replies

ivanstepanovftw
Apr 15, 2023
Collaborator Author

slaren Apr 15, 2023
Collaborator

slaren
Apr 15, 2023
Collaborator

ivanstepanovftw Apr 15, 2023
Collaborator Author