Skip to content

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

License

Notifications You must be signed in to change notification settings

acaelles97/DeVIS

Repository files navigation

DeVIS: Making Deformable Transformers Work for Video Instance Segmentation

This repository provides the official implementation of the DeVIS: Making Deformable Transformers Work for Video Instance Segmentation paper by Adrià Caelles, Tim Meinhardt, Guillem Brasó and Laura Leal-Taixe. The codebase builds upon Deformable DETR, VisTR and TrackFormer.

Abstract

Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset.

Results

Click on the evaulation benchmark you want to see!

COCO

Model Backbone box AP mask AP AP50 AP75 APl APm Aps FPS
Mask R-CNN R50 41.0 37.2 58.5 39.8 53.3 39.4 18.6 21.4
Mask2Former R50 - 43.7 - - 64.8 47.2 23.4 13.5
Ours R50 46.3 38.0 61.4 40.1 59.8 41.4 17.9 12.1
Mask R-CNN R101 42.9 38.6 60.4 41.3 55.3 41.3 19.4 -
Mask2Former R101 - 44.2 - - 67.7 47.7 23.8 -
Ours R101 47.9 39.9 63.0 42.1 61.5 43.9 19.9 -
Mask2Former R101 - 50.1 - - 72.1 53.9 31.0 -
Ours SwinL 54.6 45.2 61.4 40.1 59.8 41.4 17.9 -

YouTube-VIS-2019

Model Backbone AP AP50 AP75 AR1 AR10 FPS
VisTR R50 36.2 59.8 36.9 37.2 42.4 69.9
IFC R50 41.2 65.1 44.6 42.3 49.6 107.1
SeqFormer R50 45.1 66.9 50.5 45.6 54.6 -
Mask2Former R50 46.4 68.0 50.0 - - -
Ours (T=6, S=4) R50 44.4 67.9 48.6 42.4 51.6 18.4
SeqFormer SwinL 59.3 82.1 66.4 51.7 64.4 -
Mask2Former SwinL 60.4 84.4 67.0 - - -
Ours (T=6, S=4) SwinL 57.1 80.8 66.3 50.8 61.0 -

YouTube-VIS-2021

Model Backbone AP AP50 AP75 AR1 AR10
IFC R50 35.2 57.2 37.5 - -
SeqFormer R50 40.5 62.4 43.7 36.1 48.1
Mask2Former R50 40.6 60.9 41.8 - -
Ours (T=6, S=4) R50 43.1 66.8 46.6 38.0 50.1
SeqFormer SwinL 51.8 74.6 58.2 42.8 58.1
Mask2Former SwinL 52.6 76.4 57.2 - -
Ours (T=6, S=4) SwinL 54.4 77.7 59.8 43.8 57.8

OVIS

Model Backbone AP AP50 AP75 AR1 AR10
CrossVis R50 14.9 32.7 12.1 10.3 19.8
TeViT R50 17.4 34.9 15.0 11.2 21.8
Ours (T=6, S=4) R50 23.7 47.6 20.8 12.0 28.9
Ours (T=6, S=4) SwinL 35.5 59.3 38.3 16.6 39.8

Configuration

Our configuration system is based on YACS (similar as detectron2). We hope this allows the research community to more easily build upon our method. Refer to src/config.py to get an overview of all the configuration options available including how the model is built, training and test options. All the default config values correspond to the Deformable DETR + iterative bounding box refinement model, making easier for the user to understand the changes we have introduced upon it. On the other hand, config values that are unique to DeVIS are set to YT-19 model. We use uppercase words (e.g. MODEL.NUM_QUERIES) to refer to config parameters.

Install

We refer to our docs/INSTALL.md for detailed installation instructions.

Train

We refer to our docs/TRAIN.md for detailed training instructions.

Evaluate

To evaluate model's performance, you just need to add the --eval-only argument and set MODEL.WEIGHTS to the checkpoint path via command line. For example, the following command shows how to obtain YT-19 val predictions:

python main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml --eval-only MODEL.WEIGHTS /path/to/yt-19_checkpoint_file

We also support multi GPU test, so you only need to set --nproc_per_node to the number of GPUs desired.

torchrun --nproc_per_node=4 main.py --config-file configs/devis/YT-19/devis_R_50_YT-19.yaml --eval-only MODEL.WEIGHTS /path/to/yt-19_checkpoint_file

Furthermore, we have added the option to validate several checkpoints once the training finishes by simply pointing TEST.INPUT_FOLDER to the output training directory and TEST.EPOCHS_TO_EVAL to the epochs you want to validate.

Visualize results

When TEST.VIZ.OUT_VIZ_PATH=path/to/save is specified, the visual results from the .json file will be saved. Additionally, TEST.VIZ.SAVE_CLIP_VIZ allows saving results from the sub-clips (without the clip tracking being involved). Finally, TEST.VIZ.SAVE_MERGED_TRACKS=True plots all tracks on the same image (same as figures from the paper).

We provide an additional config file that changes thresholds to get more visual appealing results as well as TEST.VIZ.VIDEO_NAMES to infer only the specified videos (the ones shown below). The following command shows how to get visual results from YT-21 val set:

python main.py --config-file configs/devis/devis_R_50_visualization_YT-21.yaml --eval-only MODEL.WEIGHTS /path/to/yt-21_checkpoint_file

To generate the video, you just need to then enter to the output folder containing all the images and use:

ffmpeg -framerate 5 -pattern_type glob -i '*.jpg' -c:v libx264 -pix_fmt yuv420p out.mp4
MOT17-03-SDP MOTS20-07

Attention maps

We also provide an additional script visualize_att_maps.py to generate attention maps. We recommend using the aforementioned visualization config file. The script allows to choose the decoder layer as well as whether to merge resolutions or not (see args_parse() for more info).

python visualize_att_maps.py --config-file configs/devis/devis_R_50_visualization_YT-21.yaml --merge-resolution 1 MODEL.WEIGHTS /path/to/yt-21_checkpoint_file

Publication

If you use this software in your research, please cite our publication:

@article{devis,
  author = {Caelles, Adrià and Meinhardt, Tim and Brasó, Guillem and Leal-Taixé, Laura},
  title = {{DeVIS: Making Deformable Transformers Work for Video Instance Segmentation}},
  journal = {arXiv:2207.11103},
  year = {2022},
}

Releases

No releases published

Packages

No packages published

Languages