Skip to content

Latest commit

 

History

History

llama3_8b_instruct_clip_vit_large_p14_336

LLaVA-Llama-3-8B

Results

Image
Model MMBench Test (EN) MMBench Test (CN) CCBench Dev MMMU Val SEED-IMG AI2D Test ScienceQA Test HallusionBench aAcc POPE GQA TextVQA MME MMStar Configs
LLaVA-v1.5-7B 66.5 59.0 27.5 35.3 60.5 54.8 70.4 44.9 85.9 62.0 58.2 1511/348 30.3 -
LLaVA-Llama-3-8B 68.9 61.6 30.4 36.8 69.8 60.9 73.3 47.3 87.2 63.5 58.0 1506/295 38.2 Pretrain / Fine-tune
LLaVA-Llama-3-8B-v1.1 72.3 66.4 31.6 36.8 70.1 70.0 72.9 47.7 86.4 62.6 59.0 1469/349 45.1 Pretrain / Fine-tune

Resources

Data Preparation

LLaVA dataset

File structure

./data/llava_data
├── LLaVA-Pretrain
│   ├── blip_laion_cc_sbu_558k.json
│   ├── blip_laion_cc_sbu_558k_meta.json
│   └── images
├── LLaVA-Instruct-150K
│   └── llava_v1_5_mix665k.json
└── llava_images
    ├── coco
    │   └── train2017
    ├── gqa
    │   └── images
    ├── ocr_vqa
    │   └── images
    ├── textvqa
    │   └── train_images
    └── vg
        ├── VG_100K
        └── VG_100K_2

Pretrain

LLaVA-Pretrain

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1

Finetune

  1. Text data

    1. LLaVA-Instruct-150K

      # Make sure you have git-lfs installed (https://git-lfs.com)
      git lfs install
      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
  2. Image data

    1. COCO (coco): download url

    2. GQA (gqa): download url

    3. OCR-VQA (ocr_vqa): download script

      1. ⚠️ Modify the name of OCR-VQA's images to keep the extension as .jpg!

        #!/bin/bash
        ocr_vqa_path="<your-directory-path>"
        
        find "$target_dir" -type f | while read file; do
            extension="${file##*.}"
            if [ "$extension" != "jpg" ]
            then
                cp -- "$file" "${file%.*}.jpg"
            fi
        done
    4. TextVQA (textvqa): download url

    5. VisualGenome (VG): part1, part2

ShareGPT4V dataset

Reference: https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md

File structure

./data/sharegpt4v
├── share-captioner_coco_lcs_sam_1246k_1107.json
├── sharegpt4v_instruct_gpt4-vision_cap100k.json
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
└── data
    ├── sam
    │   └── images
    ├── share_textvqa
    │   └── images
    ├── web-celebrity
    │   └── images
    ├── web-landmark
    │   └── images
    ├── wikiart
    │   └── images
    ├── llava
    │   └── llava_pretrain
    │       └── images -> ../../../../llava_data/LLaVA-Pretrain/images
    ├── coco -> ../../llava_data/llava_images/coco
    ├── gqa -> ../../llava_data/llava_images/gqa
    ├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
    ├── textvqa -> ../../llava_data/llava_images/textvqa
    └── vg -> ../../llava_data/llava_images/vg

Download

  1. Text data

    wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
    wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json
    wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
  2. Image data

    1. SAM (sam): download url

    2. ShareTextVQA (share_textvqa): download url

    3. Web-Celebrity (web-celebrity): download url

    4. Web-Landmark (web-landmark): download url

    5. WikiArt (wikiart): download url

    6. llava, coco , gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.

InternVL-SFT

Reference: https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets

File structure

./data/internvl_sft
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
└── data
    ├── ai2d
    │   ├── abc_images
    │   └── images
    ├── chartqa
    │   ├── test
    │   ├── train
    │   └── val
    ├── docvqa
    │   ├── test
    │   ├── train
    │   └── val
    ├── dvqa
    │   └── images
    ├── synthdog-en
    │   └── images
    ├── geoqa+
    │   └── images
    ├── llava
    │   └── llava_pretrain
    │       └── images -> ../../../../llava_data/LLaVA-Pretrain/images
    ├── coco -> ../../llava_data/llava_images/coco
    ├── gqa -> ../../llava_data/llava_images/gqa
    ├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
    ├── textvqa -> ../../llava_data/llava_images/textvqa
    ├── vg -> ../../llava_data/llava_images/vg
    ├── sam -> ../../sharegpt4v/data/sam
    ├── share_textvqa -> ../../sharegpt4v/data/share_textvqa
    ├── web-celebrity -> ../../sharegpt4v/data/web-celebrity
    ├── web-landmark -> ../../sharegpt4v/data/web-landmark
    └── wikiart -> ../../sharegpt4v/data/wikiart

Download

  1. Text data

    wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip
    unzip ./playground.zip
  2. Image data

    1. AI2D (ai2d): download url

    2. ChartQA (chartqa): download url

    3. DocVQA (docvqa): train, val, test

    4. DVQA (dvqa): download url

    5. SynthDoG-EN (synthdog-en): download url

    6. GeoQA+ (geoqa+): download url

    7. llava, coco, gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.

    8. sam, share_textvqa, web-celebrity, web-landmark, wikiart: Please refer to the preparation of ShareGPT4V dataset.

Training

LLaVA-LLama-3-8B

  1. Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024
  1. Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune/)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2 --seed 1024

LLaVA-LLama-3-8B-v1.1 (Recommended)

  1. Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024
  1. Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune/)
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024

Singlg card?

XTuner also supports single-card training for LLaVA-Llama-3-8B (Youth Edition), requiring only a single card with 20GB to complete the entire process of multi-modal training.

  1. Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/)
xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain --deepspeed deepspeed_zero2 --seed 1024
  1. Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune/)
xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024

Model Conversion (and Merge)

Step 0. Convert .pth file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)

After training, we will obtain a set of weights (i.e., iter_xxx.pth), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.

xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner

At this point, we have obtained the relevant model (LLM or the corresponding LoRA). If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting. It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.

./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── projector
│   ├── config.json
│   ├── configuration_projector.py
│   ├── modeling_projector.py
│   └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── visual_encoder_adapter
    ├── adapter_config.json
    ├── adapter_model.safetensors
    └── README.md

At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by

xtuner chat ./iter_39620_xtuner \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava ./iter_39620_xtuner \
  --prompt-template llama3_chat \
  --image $IMAGE_PATH

and in MMBench evaluation, by

xtuner mmbench ./iter_39620_xtuner \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava ./iter_39620_xtuner \
  --prompt-template llama3_chat \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH

Here, $DATA_PATH refers to one of the mmbench datasets. You can download the expected data by

wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv

Step 1. Merge ViT LoRA into the original ViT

Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.

xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip

Step 2. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format

To official LLaVA format (xtuner/llava-llama-3-8b-v1_1-hf)

We can utilize the following command to obtain the LLaVA model in the official LLaVA format.

python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava

Here, the converted LLaVA model in official LLaVA format is saved to ./iter_39620_llava.

./iter_39620_llava
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

To HuggingFace LLaVA format (xtuner/llava-llama-3-8b-v1_1-transformers)

We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.

python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf

Here, the converted LLaVA model in HuggingFace LLaVA format is saved to ./iter_39620_hf.

./iter_39620_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

Chat

  • XTuner LLaVA format docs
  • Official LLaVA format docs
  • HuggingFace LLaVA format docs
  • GGUF format docs

Deployment

LMDeploy now supports the deployment of official LLaVA format models (e.g.,xtuner/llava-llama-3-8b-v1_1-hf). For specifics, please refer to here.