v0.4.0: Mixtral-8x7B, DPO-ftx, AutoGPTQ Integration

hiyouga released this 16 Dec 13:48

· 962 commits to main since this release

🚨🚨 Core refactor

Deprecate checkpoint_dir and use adapter_name_or_path instead
Replace resume_lora_training with create_new_adapter
Move the patches in model loading to llmtuner.model.patcher
Bump to Transformers 4.36.1 to adapt to the Mixtral models
Wide adaptation for FlashAttention2 (LLaMA, Falcon, Mistral)
Temporarily disable LongLoRA due to breaking changes, which will be supported later

The above changes were made by @hiyouga in #1864

New features

Add DPO-ftx: mixing fine-tuning gradients to DPO via the dpo_ftx argument, suggested by @lylcst in #1347 (comment)
Integrate AutoGPTQ into the model export via the export_quantization_bit and export_quantization_dataset arguments
Support loading datasets from ModelScope Hub by @tastelikefeet and @wangxingjun778 in #1802
Support resizing token embeddings with the noisy mean initialization by @hiyouga in a66186b
Support system column in both alpaca and sharegpt dataset formats

New models

Base models
- Mixtral-8x7B-v0.1
Instruct/Chat models
- Mixtral-8x7B-v0.1-instruct
- Mistral-7B-Instruct-v0.2
- XVERSE-65B-Chat
- Yi-6B-Chat

Bug fix

Improve logging for unknown arguments by @yhyu13 in #1868
Fix an overflow issue in LLaMA2 PPO training #1742
Fix #246 #1561 #1715 #1764 #1765 #1770 #1771 #1784 #1786 #1795 #1815 #1819 #1831

Contributors

wangxingjun778, hiyouga, and 3 other contributors

Assets 2