Better quantized models for Mixtral-8x7b #4800

ikawrakow · 2024-01-06T17:25:37Z

ikawrakow
Jan 6, 2024

I have published improved quantizations for Mixtral-8x7b on Huggingface.
For more details see #4364.

Note, these are for the base, not instruct tuned, Mixtral-8x7b (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1). I'm planning to spend some time learning how to best quantize chat/instruct tuned models next.

The table below shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens.
The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(int8))/PPL(int8).
Running the full fp16 Mixtral8x7b model on the systems I have available takes too long, so I'm comparing against the 8-bit quantized model, where I get PPL = 4.1049 (but from past experience the 8-bit quantization should be basically equivalent to fp16).

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q2_K	mixtral-8x7b-q2k.gguf	7.4660	81.9%	5.0576	23.2%
Q3_K_S	mixtral-8x7b-q3k-small.gguf	4.4601	8.65%	4.3848	6.82%
Q3_K_M	mixtral-8x7b-q3k-medium.gguf	4.4194	7.66%	4.2884	4.47%
Q4_K_S	mixtral-8x7b-q4k-small.gguf	4.2523	3.59%	4.1764	1.74%
Q4_K_M	mistral-8x7b-q4k-medium.gguf	4.2523	3.59%	4.1652	1.47%
Q5_K_S	mixtral-7b-q5k-small.gguf	4.1395	0.84%	4.1278	0.56%
Q4_0	mixtral-8x7b-q40.gguf	4.2232	2.88%	4.2001	2.32%
Q4_1	mistral-8x7b-q41.gguf	4.2547	3.65%	4.1713	1.62%
Q5_0	mistral-8x7b-q50.gguf	4.1426	0.92%	4.1335	0.70%

Kquant03 · 2024-01-08T03:09:45Z

Kquant03
Jan 8, 2024

Sir, my quantizations keep failing and I cannot figure out how to quantize them down from here

https://huggingface.co/Kquant03/MistralTrix-4x9B-MoE-ERP

https://huggingface.co/Kquant03/EarthRender-32x7B-bf16

https://huggingface.co/Kquant03/MistralTrix8x9B

https://huggingface.co/Kquant03/PsychoOrca_32x1.1B_MoE_bf16

do you have any idea how to quantize any of these?

4 replies

ikawrakow Jan 8, 2024
Author

Keep failing how?

Kquant03 Jan 8, 2024

D:\bf16>convert.py D:\bf16 --outtype f16 --outfile D:\bf16
Loading model file D:\bf16\model-00001-of-00012.safetensors
Loading model file D:\bf16\model-00001-of-00012.safetensors
Loading model file D:\bf16\model-00002-of-00012.safetensors
Loading model file D:\bf16\model-00003-of-00012.safetensors
Loading model file D:\bf16\model-00004-of-00012.safetensors
Loading model file D:\bf16\model-00005-of-00012.safetensors
Loading model file D:\bf16\model-00006-of-00012.safetensors
Loading model file D:\bf16\model-00007-of-00012.safetensors
Loading model file D:\bf16\model-00008-of-00012.safetensors
Loading model file D:\bf16\model-00009-of-00012.safetensors
Loading model file D:\bf16\model-00010-of-00012.safetensors
Loading model file D:\bf16\model-00011-of-00012.safetensors
Loading model file D:\bf16\model-00012-of-00012.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=40, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=8, n_experts_used=2, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=WindowsPath('D:/bf16'))
32000 32000
Vocab info: <VocabLoader with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 58980 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 1}, add special tokens {'bos': True, 'eos': False}>
Traceback (most recent call last):
File "D:\bf16\convert.py", line 1294, in
main()
File "D:\bf16\convert.py", line 1281, in main
ftype = pick_output_type(model, args.outtype)
File "D:\bf16\convert.py", line 1025, in pick_output_type
wq_type = model[gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ATTN_Q].format(bid=0) + ".weight"].data_type
KeyError: 'blk.0.attn_q.weight'

here is the full error code. I had to find the code.

Kquant03 Jan 8, 2024

I was wondering if maybe there's a specific branch you use that might work for a clown moe

ikawrakow Jan 8, 2024
Author

I'm not the right person to diagnose this problem as I have not been involved in the development of the Python conversion scripts at all. But as far as I know, one needs to use convert-hf-to-gguf.py with *.safetensors.

Kquant03 · 2024-01-08T16:03:55Z

Kquant03
Jan 8, 2024

I'm not the right person to diagnose this problem as I have not been involved in the development of the Python conversion scripts at all. But as far as I know, one needs to use convert-hf-to-gguf.py with *.safetensors.

there's something busted with hf-to-gguf conversion as well, but I'll search down other avenues towards this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better quantized models for Mixtral-8x7b #4800

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Better quantized models for Mixtral-8x7b #4800

ikawrakow Jan 6, 2024

Replies: 2 comments · 4 replies

Kquant03 Jan 8, 2024

ikawrakow Jan 8, 2024 Author

Kquant03 Jan 8, 2024

Kquant03 Jan 8, 2024

ikawrakow Jan 8, 2024 Author

Kquant03 Jan 8, 2024

ikawrakow
Jan 6, 2024

Replies: 2 comments 4 replies

Kquant03
Jan 8, 2024

ikawrakow Jan 8, 2024
Author

ikawrakow Jan 8, 2024
Author

Kquant03
Jan 8, 2024