4-bit KV Cache #5932
Replies: 6 comments 12 replies
-
thats is huge! hope you can implement in llama.cpp :) |
Beta Was this translation helpful? Give feedback.
-
Here is some benchmarks and information - https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md |
Beta Was this translation helpful? Give feedback.
-
Very interesting. What do @ggerganov @JohannesGaessler and @slaren think about these results? The current consensus is that 4 bit KV cache isn't worth it as the uptick in perplexity would be too severe. However, that isn't the case with Turboderp's implementation. I wonder what llama.cpp can learn from that. |
Beta Was this translation helpful? Give feedback.
-
Integral part of good performance of turboderp's KV-cache quantization is Hadamard transform for smoothing the kv distribution: |
Beta Was this translation helpful? Give feedback.
-
Now that #5021 is merged i'd love to see this, we have a lot of users who are eager for 4-bit KV cache to help save vram. |
Beta Was this translation helpful? Give feedback.
-
This might be useful: |
Beta Was this translation helpful? Give feedback.
-
Turboderp, developer of Exllama V2 has made a breakthrough: A 4 bit KV Cache that seemingly performs on par with FP16. Here are his words:
"I'm working on some benchmarks at the moment, but they're taking a while to run. Preliminary results show the Q4 cache mode is more precise overall than FP8, and comparable to full precision. HumanEval tests are still running."
https://www.reddit.com/r/LocalLLaMA/comments/1b9571u/80k_context_possible_with_cache_4bit/
This is huge. Sadly, Llama.cpp doesn't even have full 8 bit cache right now (only K cache). So in that aspect, there's a lot of potential for improvement.
Beta Was this translation helpful? Give feedback.
All reactions