You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the README does not necessarily provide a like-for-like comparison because 4 bit quantizations can be of different quality depending on the implementation details. For example, in llama.cpp q4_0 is faster than q4_K_M but the quantization format is less efficient in terms of size. So it would be useful to include measurements for the memory usage as well as measure for the output quality (e.g. perplexity on a large corpus of text) to put the speed numbers into context.
The text was updated successfully, but these errors were encountered:
I don't know about the timeline but by now llama.cpp has support for the calculation of the KL divergence relative to FP16, see ggerganov/llama.cpp#5076 . This would be a better metric for comparison than perplexity.
sure. I am using the perplexity scores for a paper, hence I need ppl values.
Also, how to go about actually calculating the scores? I doubt I'll be able to directly run the llama script to get the scores on mlc models. Have been trying to find out a way to change mlc_chat but no progress so far.
If you have the scripts that you had used on mlc quantised models, it'd be of great help.
I'm trying to capture out the generated logits for a prompt input. no luck.
Currently the README does not necessarily provide a like-for-like comparison because 4 bit quantizations can be of different quality depending on the implementation details. For example, in llama.cpp q4_0 is faster than q4_K_M but the quantization format is less efficient in terms of size. So it would be useful to include measurements for the memory usage as well as measure for the output quality (e.g. perplexity on a large corpus of text) to put the speed numbers into context.
The text was updated successfully, but these errors were encountered: