Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - |
LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | Pretrain / Fine-tune |
LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | Pretrain / Fine-tune |
-
LLaVA-Llama-3-8B-v1.1
- Official LLaVA format model (
xtuner/llava-llama-3-8b-v1_1-hf
): 🤗 HuggingFace / 🤖 ModelScope - HuggingFace LLaVA format model (
xtuner/llava-llama-3-8b-v1_1-transformers
): 🤗 HuggingFace / 🤖 ModelScope - XTuner LLaVA format model (
xtuner/llava-llama-3-8b-v1_1
): 🤗 HuggingFace / 🤖 ModelScope - GGUF model (
xtuner/llava-llama-3-8b-v1_1-gguf
): 🤗 HuggingFace / 🤖 ModelScope - Pretrained projector weights: 🤗 HuggingFace / 🤖 ModelScope
- Official LLaVA format model (
-
LLaVA-Llama-3-8B
- Official LLaVA format model (
xtuner/llava-llama-3-8b-hf
): 🤗 HuggingFace / 🤖 ModelScope - HuggingFace LLaVA format model (
xtuner/llava-llama-3-8b-transformers
): 🤗 HuggingFace / 🤖 ModelScope - XTuner LLaVA format model (
xtuner/llava-llama-3-8b
): 🤗 HuggingFace / 🤖 ModelScope - Pretrained projector weights: 🤗 HuggingFace / 🤖 ModelScope
- Official LLaVA format model (
./data/llava_data
├── LLaVA-Pretrain
│ ├── blip_laion_cc_sbu_558k.json
│ ├── blip_laion_cc_sbu_558k_meta.json
│ └── images
├── LLaVA-Instruct-150K
│ └── llava_v1_5_mix665k.json
└── llava_images
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
LLaVA-Pretrain
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
-
Text data
-
LLaVA-Instruct-150K
# Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
-
-
Image data
-
COCO (coco): download url
-
GQA (gqa): download url
-
OCR-VQA (ocr_vqa): download script
-
⚠️ Modify the name of OCR-VQA's images to keep the extension as.jpg
!#!/bin/bash ocr_vqa_path="<your-directory-path>" find "$target_dir" -type f | while read file; do extension="${file##*.}" if [ "$extension" != "jpg" ] then cp -- "$file" "${file%.*}.jpg" fi done
-
-
TextVQA (textvqa): download url
-
Reference: https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
./data/sharegpt4v
├── share-captioner_coco_lcs_sam_1246k_1107.json
├── sharegpt4v_instruct_gpt4-vision_cap100k.json
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
└── data
├── sam
│ └── images
├── share_textvqa
│ └── images
├── web-celebrity
│ └── images
├── web-landmark
│ └── images
├── wikiart
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
└── vg -> ../../llava_data/llava_images/vg
-
Text data
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
-
Image data
-
SAM (sam): download url
-
ShareTextVQA (share_textvqa): download url
-
Web-Celebrity (web-celebrity): download url
-
Web-Landmark (web-landmark): download url
-
WikiArt (wikiart): download url
-
llava, coco , gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
-
Reference: https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets
./data/internvl_sft
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
└── data
├── ai2d
│ ├── abc_images
│ └── images
├── chartqa
│ ├── test
│ ├── train
│ └── val
├── docvqa
│ ├── test
│ ├── train
│ └── val
├── dvqa
│ └── images
├── synthdog-en
│ └── images
├── geoqa+
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
├── vg -> ../../llava_data/llava_images/vg
├── sam -> ../../sharegpt4v/data/sam
├── share_textvqa -> ../../sharegpt4v/data/share_textvqa
├── web-celebrity -> ../../sharegpt4v/data/web-celebrity
├── web-landmark -> ../../sharegpt4v/data/web-landmark
└── wikiart -> ../../sharegpt4v/data/wikiart
-
Text data
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip unzip ./playground.zip
-
Image data
-
AI2D (ai2d): download url
-
ChartQA (chartqa): download url
-
DVQA (dvqa): download url
-
SynthDoG-EN (synthdog-en): download url
-
GeoQA+ (geoqa+): download url
-
llava, coco, gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
-
sam, share_textvqa, web-celebrity, web-landmark, wikiart: Please refer to the preparation of ShareGPT4V dataset.
-
- Pretrain (saved by default in
./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/
)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024
- Fine-tune (saved by default in
./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune/
)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2 --seed 1024
- Pretrain (saved by default in
./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/
)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024
- Fine-tune (saved by default in
./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune/
)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024
XTuner also supports single-card training for LLaVA-Llama-3-8B (Youth Edition), requiring only a single card with 20GB to complete the entire process of multi-modal training.
- Pretrain (saved by default in
./work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/
)
xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain --deepspeed deepspeed_zero2 --seed 1024
- Fine-tune (saved by default in
./work_dirs/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune/
)
xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024
Step 0. Convert .pth
file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)
After training, we will obtain a set of weights (i.e., iter_xxx.pth
), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
At this point, we have obtained the relevant model (LLM or the corresponding LoRA). If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting. It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.
./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── projector
│ ├── config.json
│ ├── configuration_projector.py
│ ├── modeling_projector.py
│ └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── visual_encoder_adapter
├── adapter_config.json
├── adapter_model.safetensors
└── README.md
At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by
xtuner chat ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--image $IMAGE_PATH
and in MMBench evaluation, by
xtuner mmbench ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
Here, $DATA_PATH
refers to one of the mmbench datasets. You can download the expected data by
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.
xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip
- The official LLaVA format is structured similarly to the architecture of the liuhaotian/llava-v1.5-7b model.
- The HuggingFace LLaVA format is structured similarly to the architecture of the llava-hf/llava-1.5-7b-hf model.
To official LLaVA format (xtuner/llava-llama-3-8b-v1_1-hf)
We can utilize the following command to obtain the LLaVA model in the official LLaVA format.
python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava
Here, the converted LLaVA model in official LLaVA format is saved to ./iter_39620_llava
.
./iter_39620_llava
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
To HuggingFace LLaVA format (xtuner/llava-llama-3-8b-v1_1-transformers)
We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.
python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf
Here, the converted LLaVA model in HuggingFace LLaVA format is saved to ./iter_39620_hf
.
./iter_39620_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
LMDeploy now supports the deployment of official LLaVA format models (e.g.,xtuner/llava-llama-3-8b-v1_1-hf). For specifics, please refer to here.