Chat-UniVi v1.5

Following LLaVA v1.5, we add grounding data and visual question-answering (VQA) data into the training dataset, enhancing the model's reasoning capabilities.

1. Data

Download the training annotations. You can download from https://huggingface.co/datasets/Chat-UniVi/Chat-UniVi-Instruct/tree/main/v1.5_train_json.

Datasets	Baidu Disk
Image pretraining (From LLaVA v1.5)	Link
Image tuning (From LLaVA v1.5)	Link
Video pretraining (From Valley)	Link

2. Train the model

Stage1: Multimodal Pre-training

deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29602 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero3.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use PRETUNEv1.5 \
--dataset_use Pretrain \
--vision_tower openai/clip-vit-large-patch14-336 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage1 save path} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb

Stage2: Joint Instruction Tuning

deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29601 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero2.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use FINETUNE \
--dataset_use FINETUNEv1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ${stage1 save path}/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage2 save path} \
--num_train_epochs 2 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb

3. Results

Image Understanding Benchmarks

Methods	LLM	Visual Tokens	VQA v2	GQA	VisWiz	SQA I	VQA T	POPE	MMB	LLaVA W	MM-Vet
LLaVA v1.5	Vicuna-7B	576	78.5	62.0	50.0	66.8	58.2	85.9	64.3	63.4	30.5
Video-LLaVA	Vicuna-7B	256	74.7	60.3	48.1	66.4	51.8	84.4	60.9	73.1	32.0
Chat-UniVi-7B v1.5	Vicuna-7B	112	75.4	59.6	44.2	68.1	53.0	85.4	62.7	64.3	28.3

VideoQA

Methods	LLM Size	MSRVTT-QA		MSVD-QA		TGIF-QA		ActivityNet-QA
		Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
Video-LLaMA	7B	29.6	1.8	51.6	2.5	-	-	12.4	1.1
LLaMA-Adapter	7B	43.8	2.7	54.9	3.1	-	-	34.2	2.7
VideoChat	7B	45.0	2.5	56.3	2.8	34.4	2.3	26.5	2.2
Video-ChatGPT	7B	49.3	2.8	64.9	3.3	51.4	3.0	35.2	2.7
Video-LLaVA	7B	59.2	3.5	70.7	3.9	70.0	4.0	45.3	3.3
Chat-UniVi-7B	7B	54.6	3.1	65.0	3.6	60.3	3.4	45.8	3.2
Chat-UniVi-7B with new video loading code	7B	55.0	3.1	69.3	3.7	69.0	3.8	46.1	3.3
Chat-UniVi-7B v1.5	7B	57.5	3.2	68.8	3.7	70.0	3.8	47.2	3.3

POPE

Methods	LLM Size	Random			Popular			Adversarial
		Accuracy	F1-Score	Yes	Accuracy	F1-Score	Yes	Accuracy	F1-Score	Yes
LLaVA	7B	72.16	78.22	76.29	61.37	71.52	85.63	58.67	70.12	88.33
Video-LLaVA	7B	86.2	85.2	42.0	85.3	84.0	42.1	81.6	80.8	45.8
Chat-UniVi-7B	7B	85.19	86.05	54.67	69.50	74.39	69.10	64.97	71.54	73.10
Chat-UniVi-7B v1.5	7B	87.01	86.09	41.86	85.87	84.76	42.73	83.23	82.31	44.77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRAIN_AND_VALIDATE_V1.5.md

TRAIN_AND_VALIDATE_V1.5.md

Chat-UniVi v1.5

1. Data

2. Train the model

Stage1: Multimodal Pre-training

Stage2: Joint Instruction Tuning

3. Results

Image Understanding Benchmarks

VideoQA

POPE

Files

TRAIN_AND_VALIDATE_V1.5.md

Latest commit

History

TRAIN_AND_VALIDATE_V1.5.md

File metadata and controls

Chat-UniVi v1.5

1. Data

2. Train the model

Stage1: Multimodal Pre-training

Stage2: Joint Instruction Tuning

3. Results

Image Understanding Benchmarks

VideoQA

POPE