This project is developed based on the Mixtral model released by Mistral.ai, which utilizes a Sparse Mixture of Experts (MoE) architecture. This project involves the use of large-scale Chinese unannotated data for incremental training in Chinese, resulting in the Chinese Mixtral base model. Further fine-tuning with instructions led to the creation of the Chinese Mixtral-Instruct instruction model. This model natively supports a 32K context (tested up to 128K) and is capable of effectively processing long texts, while also showing significant performance improvements in areas like mathematical reasoning and code generation. When using llama.cpp for quantized inference, a minimum of only 16GB of memory (or VRAM) is required.

Paper: [Cui and Yao, 2024] Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral [Blog (in Chinese)]

Main Contents of This Project

🚀 Open-sourced Chinese Mixtral base model, incrementally trained in Chinese on top of Mixtral-8x7B-v0.1
🚀 Open-sourced Chinese Mixtral-Instruct instruction model, further fine-tuned based on the Chinese Mixtral
🚀 Open-sourced pre-training scripts and fine-tuning scripts for instructions, enabling users to further train or fine-tune the model as needed
🚀 Tutorial for quick local deployment and quantization of large models using personal computer CPU/GPU
🚀 Supports 🤗transformers, llama.cpp, text-generation-webui, LangChain, privateGPT, vLLM and other Mixtral ecosystem components

News

[Apr 30, 2024] Chinese-LLaMA-Alpaca-3 project introduces Llama-3-Chinese-8B and Llama-3-Chinese-8B-Instruct, based on Meta's Llama-3. Check: https://github.com/ymcui/Chinese-LLaMA-Alpaca-3

[Mar 27, 2024] Add 1-bit/2-bit/3-bit GGUF models: [🤗HF]; Meanwhile, this project has been added in the SOTA! model platform of Synced, welcome to follow: https://sota.jiqizhixin.com/project/chinese-mixtral**

[Mar 26, 2024] Add deployment method of OpenAI-style API. See: 📚v1.2 Release Notes

[Mar 5, 2024] Release pre-training and fine-tuning scripts. Technical reports are also available. See: 📚 v1.1 Release Notes

[Jan 29, 2024] 🚀 Official release of Chinese-Mixtral (Base Model), Chinese-Mixtral-Instruct (Instruction/Chat Model). For more details, see: 📚 v1.0 Release Notes

Content Guide

Chapter	Description
💁🏻‍♂️ Model Introduction	Brief introduction to the technical features of the models related to this project
⏬ Model Download	Download address for Chinese Mixtral large model
💻 Inference and Deployment	How to quantify the model and deploy it using a personal computer
💯 Model Performance	Introduction to the model's performance in certain tasks
📝 Training and Fine-tuning	How to train and fine-tune the Chinese Mixtral large model
❓ Frequently Asked Questions	Responses to some common questions

Model Introduction

This project open-sources the Chinese Mixtral and Chinese Mixtral-Instruct models developed based on the Mixtral model, with the following main features:

📖 Sparse Mixture of Experts Model

Mixtral is a Sparse Mixture of Experts model. This model significantly differs from mainstream large models like LLaMA in several aspects:

Each FFN layer contains 8 different "experts" (fully connected layers), with the best 2 activated based on gating values.
Every token in the input sequence independently selects an expert, rather than the entire sequence corresponding to a set of experts.
The actual parameter count is about 46.7B, with around 13B activated during inference.

Below is a structural diagram from the Mixtral paper:

🚄 Natively Supports 32K Context (Tested up to 128K)

Unlike the Chinese-LLaMA-Alpaca and Chinese-LLaMA-Alpaca-2 projects, the Mixtral model natively supports a 32K context (tested up to 128K). Users can use a single model to solve various tasks of different lengths.

Model Download

Model Selection Guide

Here is a comparison of the models in this project and the recommended use cases. For chat interactions, please choose the Instruct version.

Comparison Item	Chinese Mixtral	Chinese Mixtral-Instruct
Model Type	Base Model	Instruction/Chat Model (akin to ChatGPT)
Model Size	8x7B (about 13B activated)	8x7B (about 13B activated)
Number of Experts	8 (2 activated)	8 (2 activated)
Training Type	Causal-LM (CLM)	Instruction fine-tuning
Training Method	QLoRA + Full emb/lm-head	QLoRA + Full emb/lm-head
Based on Which Model	Original Mixtral-8x7B-v0.1	Chinese Mixtral
Training Corpus	Unannotated general corpus	Annotated instruction data
Vocabulary Size	Original vocabulary, 32000	Original vocabulary, 32000
Supported Context Length	32K (tested up to 128K)	32K (tested up to 128K)
Input Template	Not required	Required to apply Mixtral-Instruct template
Applicable Scenarios	Text continuation	QA, chat, etc.

Download Links

Three different types of models are provided below:

Full Version Model: Can be used directly without any merging steps, recommended for users with sufficient network bandwidth.
LoRA Version Model: Cannot be used alone, must be merged with the original Mixtral-8x7B-v0.1 to convert into a full version model. Recommended for users with limited network bandwidth who already have the original Mixtral. For merging method, please refer to: 💻 Model Merging Steps
GGUF Version Model: A GGUF quantized version model compatible with tools like llama.cpp, recommended for users who only need to perform inference deployment.

Model Name	Type	Setting	Full Version (87 GB)	LoRA Version (2.4 GB)	GGUF Version
Chinese-Mixtral	Base Model	8x7B	[Baidu] [🤗HF] [🤖ModelScope]	[Baidu] [🤗HF] [🤖ModelScope]	[🤗HF]
Chinese-Mixtral-Instruct	Instruction Model	8x7B	[Baidu] [🤗HF] [🤖ModelScope]	[Baidu] [🤗HF] [🤖ModelScope]	[🤗HF]

Note

If you are unable to access HF, consider using some mirror sites (like hf-mirror.com), please find the method yourself.

Inference and Deployment

The related models in this project mainly support the following quantization, inference, and deployment methods, please refer to the respective tutorials for specific content.

Tool	Features	CPU	GPU	Quantization	GUI	API	vLLM	Tutorial
llama.cpp	Rich quantization options and efficient local inference	✅	✅	✅	❌	✅	❌	[link]
🤗Transformers	Native transformers inference interface	✅	✅	✅	✅	❌	✅	[link]
Imitation OpenAI API Call	Server Demo with OpenAI API-like interface	✅	✅	✅	❌	✅	✅	[link]
text-generation-webui	Frontend Web UI deployment method	✅	✅	✅	✅	✅	❌	[link]
LangChain	Open-source framework for large model applications suitable for secondary development	✅	✅	✅	❌	❌	❌	[link]
privateGPT	Local multi-document Q&A framework	✅	✅	✅	❌	❌	❌	[link]
LM Studio	Multi-platform chat software (with interface)	✅	✅	✅	✅	✅	❌	[link]

Model Performance

To evaluate the effectiveness of the related models, this project conducted both generative effect evaluation and objective effect evaluation (NLU category), assessing large models from different perspectives. Users are recommended to test on the tasks they are interested in and choose models that are best suited for those tasks.

Generative Effect Evaluation

This project, inspired by Fastchat Chatbot Arena, has launched an online model battle platform to browse and evaluate the quality of model responses. The battle platform provides evaluation metrics such as win rate and Elo rating, and one can view the win rates of model matchups. ⚔️ Model Arena: http://llm-arena.ymcui.com
The examples directory provides output samples of Chinese-Mixtral-Instruct and Chinese-Alpaca-2-13B, and compares scores using GPT-4, with Chinese-Mixtral-Instruct averaging a score of 8.20 and Chinese-Alpaca-2-13B averaging 7.05. 📄 Output Sample Comparison: examples

Objective Effect Evaluation

C-Eval

C-Eval is a comprehensive Chinese baseline model evaluation suite, where the validation and test sets contain 1.3K and 12.3K multiple-choice questions, respectively, covering 52 subjects. For C-Eval inference code, please refer to this project: 📖GitHub Wiki

Models	Type	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	chat	51.7	55.0	50.0	51.5
Chinese-Mixtral	base	45.8	54.2	43.1	49.1
Mixtral-8x7B-Instruct-v0.1	chat	51.6	54.0	48.7	50.7
Mixtral-8x7B-v0.1	base	47.3	54.6	46.1	50.3
Chinese-Alpaca-2-13B	chat	44.3	45.9	42.6	44.0
Chinese-LLaMA-2-13B	base	40.6	42.7	38.0	41.6

CMMLU

CMMLU is another comprehensive Chinese evaluation dataset specifically designed to assess the knowledge and reasoning ability of language models in Chinese contexts. It covers 67 topics from basic subjects to advanced professional levels, with a total of 11.5K multiple-choice questions. For CMMLU inference code, please refer to this project: 📖GitHub Wiki

Models	Type	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	chat	50.0	53.0
Chinese-Mixtral	base	42.5	51.0
Mixtral-8x7B-Instruct-v0.1	chat	48.2	51.6
Mixtral-8x7B-v0.1	base	44.3	51.6
Chinese-Alpaca-2-13B	chat	43.2	45.5
Chinese-LLaMA-2-13B	base	38.9	42.5

MMLU

MMLU is an English evaluation dataset for assessing natural language understanding abilities. It is one of the main datasets used today for evaluating the capabilities of large models. The validation and test sets contain 1.5K and 14.1K multiple-choice questions, respectively, covering 57 subjects. For MMLU inference code, please refer to this project: 📖GitHub Wiki

Models	Type	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	chat	65.1	69.6	67.5	69.8
Chinese-Mixtral	base	63.2	67.1	65.5	68.3
Mixtral-8x7B-Instruct-v0.1	chat	68.5	70.4	68.2	70.2
Mixtral-8x7B-v0.1	base	64.9	69.0	67.0	69.5
Chinese-Alpaca-2-13B	chat	49.6	53.2	50.9	53.5
Chinese-LLaMA-2-13B	base	46.8	50.0	46.6	51.8

LongBench

LongBench is a benchmark for evaluating the long-text understanding abilities of large models. It consists of 6 categories and 20 different tasks, most of which have an average length of 5K-15K words, totaling about 4.75K test items. Below are the evaluation results of this project's model on these Chinese tasks (including coding tasks). For LongBench inference code, please refer to this project: 📖GitHub Wiki

Models	Single-doc QA	Multi-doc QA	Summarization	Few-shot Learning	Code Completion	Synthetic Task	Avg
Chinese-Mixtral-Instruct	50.3	34.2	16.4	42.0	56.1	89.5	48.1
Chinese-Mixtral	32.0	23.7	0.4	42.5	27.4	14.0	23.3
Mixtral-8x7B-Instruct-v0.1	56.5	35.7	15.4	46.0	63.6	98.0	52.5
Mixtral-8x7B-v0.1	35.5	9.5	16.4	46.5	57.2	83.5	41.4
Chinese-Alpaca-2-13B-16K	47.9	26.7	13.0	22.3	46.6	21.5	29.7
Chinese-LLaMA-2-13B-16K	36.7	17.7	3.1	29.8	13.8	3.0	17.3
Chinese-Alpaca-2-7B-64K	44.7	28.1	14.4	39.0	44.6	5.0	29.3
Chinese-LLaMA-2-7B-64K	27.2	16.4	6.5	33.0	7.8	5.0	16.0

Quantization Effect Evaluation

Under llama.cpp, the performance of the quantized version of the Chinese-Mixtral model was tested, as shown in the table below.

	F16	Q8_0	Q6_K	Q5_K	Q5_0	Q4_K	Q4_0	Q3_K	IQ3_XXS	Q2_K	IQ2_XS	IQ2_XXS
Size (GB)	87.0	46.2	35.7	30.0	30.0	24.6	24.6	19.0	17.1	16.1	12.7	11.4
BPW	16.0	8.50	6.57	5.69	5.52	4.87	4.53	3.86	3.14	2.96	2.34	2.10
PPL	-	4.4076	4.4092	4.4192	4.4224	4.4488	4.4917	4.5545	4.5990	5.1846	6.9784	8.5981
M3 Max Speed	-	-	36.0	36.9	35.7	31.2	27.8	37.6	-	29.1	-	-
A100 Speed	-	-	29.9	22.6	20.5	21.7	17.1	21.7	20.6	20.3	23.7	22.5

Note

Model Size: in GB
BPW (Bits-Per-Weight): Bits per unit parameter, e.g., Q6_K has an actual average precision of 6.57 bits
PPL (Perplexity): Measured with a 4K context, lower values are better
Generation Speed: Provided for Apple M3 Max (Metal) and NVIDIA A100 (40G) generation speed (ms/token), lower values are better

Taking Chinese-Mixtral-Q4_0 as an example, the following figure shows the trend of PPL changes under different context lengths, with two different sets of pure text data selected. The experimental results indicate that the context length supported by the Mixtral model exceeds the nominal 32K, and it still performs well in contexts of 64K+ (tested up to 128K).

Training and Fine-Tuning

Pre-training

Based on the original Mixtral model, incremental training was carried out using large-scale unlabeled data to obtain the Chinese-Mixtral base model.
Training data used the same data as in the Chinese-LLaMA-Alpaca project for the base model, totaling about 20G of pure text files.
Training code and tutorial: 📖 Pre-training Scripts Wiki

Instruction Fine-Tuning

Based on Chinese-Mixtral, further fine-tuning was done using annotated instruction data to obtain the Chinese-Mixtral-Instruct instruction model.
Training data used instruction data from the Chinese-LLaMA-Alpaca-2 project, totaling about 5 million instruction data.
Training code and tutorial: 📖 Instruction Fine-Tuning Scripts Wiki

Instruction Template:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

Note: <s> and </s> are special tokens indicating the start and end of a sequence, while [INST] and [/INST] are ordinary strings.

Frequently Asked Questions

Please make sure to check the FAQ for existing solutions before raising an Issue. For specific questions and answers, refer to the project's 📖GitHub Wiki

Question 1: Will there be training with more data in the future? Will there be RLHF/DPO alignment?
Question 2: Why wasn't there an expansion of the Chinese vocabulary in this model?
Question 3: Is the downstream ecosystem of Mixtral supported?

Citation

@article{chinese-mixtral,
      title={Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral}, 
      author={Cui, Yiming and Yao, Xin},
      journal={arXiv preprint arXiv:2403.01851},
      url={https://arxiv.org/abs/2403.01851},
      year={2024}
}

Disclaimer

This project is based on the Mixtral model released by Mistral.ai and developed accordingly. Please strictly adhere to Mixtral's open-source license agreement during use. If third-party codes are involved, please comply with the relevant open-source licenses. The accuracy of the content generated by the model may be affected by computational methods, random factors, and loss of quantization precision. Therefore, this project does not guarantee the accuracy of the model's output and will not bear responsibility for any losses arising from the use of related resources and output results. If the models from this project are used for commercial purposes, developers should comply with local laws and regulations to ensure the compliance of the model output content. This project does not bear responsibility for any products or services derived from its use.

Feedback

If you have any questions, please submit them in the GitHub Issues. Please raise issues politely to build a harmonious discussion community.

Before submitting an issue, please check if the FAQ can solve your problem, and it is also advisable to review past issues.
When submitting an issue, please use the Issue template set by this project to help quickly identify specific problems.
Duplicate or unrelated issues will be handled by stable-bot, thank you for your understanding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Main Contents of This Project

News

Content Guide

Model Introduction

📖 Sparse Mixture of Experts Model

🚄 Natively Supports 32K Context (Tested up to 128K)

Model Download

Model Selection Guide

Download Links

Inference and Deployment

Model Performance

Generative Effect Evaluation

Objective Effect Evaluation

C-Eval

CMMLU

MMLU

LongBench

Quantization Effect Evaluation

Training and Fine-Tuning

Pre-training

Instruction Fine-Tuning

Frequently Asked Questions

Citation

Disclaimer

Feedback

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Main Contents of This Project

News

Content Guide

Model Introduction

📖 Sparse Mixture of Experts Model

🚄 Natively Supports 32K Context (Tested up to 128K)

Model Download

Model Selection Guide

Download Links

Inference and Deployment

Model Performance

Generative Effect Evaluation

Objective Effect Evaluation

C-Eval

CMMLU

MMLU

LongBench

Quantization Effect Evaluation

Training and Fine-Tuning

Pre-training

Instruction Fine-Tuning

Frequently Asked Questions

Citation

Disclaimer

Feedback