Better quantized models for Mixtral-8x7b #4800
Replies: 2 comments 4 replies
-
Sir, my quantizations keep failing and I cannot figure out how to quantize them down from here https://huggingface.co/Kquant03/MistralTrix-4x9B-MoE-ERP https://huggingface.co/Kquant03/EarthRender-32x7B-bf16 https://huggingface.co/Kquant03/MistralTrix8x9B https://huggingface.co/Kquant03/PsychoOrca_32x1.1B_MoE_bf16 do you have any idea how to quantize any of these? |
Beta Was this translation helpful? Give feedback.
-
there's something busted with hf-to-gguf conversion as well, but I'll search down other avenues towards this |
Beta Was this translation helpful? Give feedback.
-
I have published improved quantizations for Mixtral-8x7b on Huggingface.
For more details see #4364.
Note, these are for the base, not instruct tuned, Mixtral-8x7b (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). I'm planning to spend some time learning how to best quantize chat/instruct tuned models next.
The table below shows a comparison between these models and the current
llama.cpp
quantization approach using Wikitext perplexities for a context length of 512 tokens.The "Quantization Error" columns in the table are defined as
(PPL(quantized model) - PPL(int8))/PPL(int8)
.Running the full
fp16
Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I getPPL = 4.1049
(but from past experience the 8-bit quantization should be basically equivalent tofp16
).Beta Was this translation helpful? Give feedback.
All reactions