Skip to content
/ svitt Public

Code for CVPR 2023 paper "SViTT: Temporal Learning of Sparse Video-Text Transformers"

License

Notifications You must be signed in to change notification settings

JerryYLi/svitt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SViTT: Temporal Learning of Sparse Video-Text Transformers (CVPR 2023)

Yi Li1, Kyle Min2, Subarna Tripathi2, Nuno Vasconcelos1

1University of California, San Diego, 2Intel Labs

Project page | Paper | 8-min video

This repository contains PyTorch implementation of SViTT, a sparse multimodal transformer for video-language learning.

Get started

conda env create -n svitt --file environment.yml
conda activate svitt

Data

All datasets are expected under data/ directory with the following structure (other downstream datasets follow the same structure as MSRVTT):

data/
├── anno_pretrain/
│   └── webvid_train.json
├── anno_downstream/
│   ├── msrvtt_test1k.json
│   └── ...
├── webvid_videos/
│   └── *.mp4
├── msrvtt_videos/
│   └── *.mp4
└── ...

Raw videos should be downloaded from the websites of respective datasets. Annotations for pre-training and downstream tasks are available in the Singularity repo; additional annotations for Charades and AGQA used in this work are available here.

Example usage

We follow the same structure of training and evaluation scripts as Singularity, with additional options for temporal modeling and sparse training.

Pre-training

To train a 4-frame SViTT model on WebVid: (use arg=value to override any arguments in configs/pretrain_webvid.yaml)

bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
    video_input.num_frames=4 \
    output_dir=$OUTPUT_DIR

To perform temporal sparse expansion to 8 frames:

bash scripts/pretrain.sh pt_webvid webvid $GPUS local \
    pretrained_path=$CKPT \
    video_input.num_frames=8 \
    vision_encoder_args.token_keep_rate=0.6 \
    output_dir=$OUTPUT_DIR

Downstream evaluation

It is recommended to use the same sparsity parameters (vision_encoder_args and joint_encoder_args) as the pre-trained model, though you can also override them with different values.

To evaluate zero-shot text-to-video retrieval (MSRVTT, DiDeMo):

bash scripts/eval_ret.sh $DATASET $CKPT eval-ret-$DATASET local $GPUS

To fine-tune text-to-video retrieval (Charades, SSv2):

bash scripts/train_ret.sh $DATASET $CKPT train-ret-$DATASET local $GPUS

To fine-tune video question answering (MSRVTT-QA, ActivityNet-QA, AGQA):

bash scripts/train_qa.sh $DATASET $CKPT train-qa-$DATASET local $GPUS

Acknowledgements

This project is built primarily on top of the awesome Singularity codebase. We also acknowledge the use of several other open-source repositories, including Frozen in Time, ALBEF, and 🤗 Transformers. This work was funded in part by NSF award IIS-2041009.

Citation

If you find this repo useful, please cite our work. Thanks!

@inproceedings{li2023svitt,
  title={{SViTT}: Temporal Learning of Sparse Video-Text Transformers},
  author={Li, Yi and Min, Kyle and Tripathi, Subarna and Vasconcelos, Nuno},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18919--18929},
  year={2023}
}

About

Code for CVPR 2023 paper "SViTT: Temporal Learning of Sparse Video-Text Transformers"

Topics

Resources

License

Stars

Watchers

Forks