UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

This is the official PyTorch implementation and models for UP-DETR paper and the extended version:

@ARTICLE{9926201,
  author={Dai, Zhigang and Cai, Bolun and Lin, Yugeng and Chen, Junying},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Unsupervised Pre-Training for Detection Transformers}, 
  year={2022},
  volume={},
  number={},
  pages={1-11},
  doi={10.1109/TPAMI.2022.3216514}}

@InProceedings{Dai_2021_CVPR,
    author    = {Dai, Zhigang and Cai, Bolun and Lin, Yugeng and Chen, Junying},
    title     = {UP-DETR: Unsupervised Pre-Training for Object Detection With Transformers},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {1601-1610}
}

In UP-DETR, we introduce a novel pretext named random query patch detection to pre-train transformers for object detection. UP-DETR inherits from DETR with the same ResNet-50 backbone, same Transformer encoder, decoder and same codebase. With unsupervised pre-training CNN, the whole UP-DETR pre-training doesn't require any human annotations. UP-DETR achieves 43.1 AP(even higher) on COCO with 300 epochs fine-tuning. The AP of open-source version is a little higher than paper report.

Model Zoo

We provide pre-training UP-DETR and fine-tuning UP-DETR models on COCO, and plan to include more in future. The evaluation metric is same to DETR.

Here is the UP-DETR model pre-trained on ImageNet without labels. The CNN weight is initialized from SwAV, which is fixed during the transformer pre-training:

name	backbone	epochs	url	size	md5
UP-DETR	R50 (SwAV)	60	model \| logs	164Mb	`49f01f8b`

The result of UP-DETR fine-tuned on COCO:

name	backbone (pre-train)	epochs	box AP	AP_S	AP_M	AP_L	url
DETR	R50 (Supervised)	500	42.0	20.5	45.8	61.1	-
DETR	R50 (SwAV)	300	42.1	19.7	46.3	60.9	-
UP-DETR	R50 (SwAV)	300	43.1	21.6	46.8	62.4	model \| logs

COCO val5k evaluation results of UP-DETR can be found in this gist.

Usage - Object Detection

There are no extra compiled components in UP-DETR and package dependencies are same to DETR. We provide instructions how to install dependencies via conda:

git clone tbd
conda install -c pytorch pytorch torchvision
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

UP-DETR follows two steps: pre-training and fine-tuning. We present the model pre-trained on ImageNet and then fine-tuned on COCO.

Unsupervised Pre-training

Data Preparation

Download and extract ILSVRC2012 train dataset.

We expect the directory structure to be the following:

path/to/imagenet/
  n06785654/  # caterogey directory
    n06785654_16140.JPEG # images
  n04584207/  # caterogey directory
    n04584207_14322.JPEG # images

Images can be organized disorderly because our pre-training is unsupervised.

Pre-training

To pr-train UP-DETR on a single node with 8 gpus for 60 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
    --lr_drop 40 \
    --epochs 60 \
    --pre_norm \
    --num_patches 10 \
    --batch_size 32 \
    --feature_recon \
    --fre_cnn \
    --imagenet_path path/to/imagenet \
    --output_dir path/to/save_model

As the size of pre-training images is relative small, so we can set a large batch size.

It takes about 2 hours for a epoch, so 60 epochs pre-training takes about 5 days with 8 V100 gpus.

In our further ablation experiment, we found that object query shuffle is not helpful. So, we remove it in the open-source version.

Fine-tuning

Data Preparation

Download and extract COCO 2017 dataset train and val dataset.

The directory structure is expected as follows:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Fine-tuning

To fine-tune UP-DETR with 8 gpus for 300 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env detr_main.py \
    --lr_drop 200 \
    --epochs 300 \
    --lr_backbone 5e-5 \
    --pre_norm \
    --coco_path path/to/coco \
    --pretrain path/to/save_model/checkpoint.pth

The fine-tuning cost is exactly same to DETR, which takes 28 minutes with 8 V100 gpus. So, 300 epochs training takes about 6 days.

The model can also extended to panoptic segmentation, checking more details on DETR.

Evaluation

python detr_main.py \
    --batch_size 2 \
    --eval \
    --no_aux_loss \
    --pre_norm \
    --coco_path path/to/coco \
    --resume path/to/save_model/checkpoint.pth

COCO val5k evaluation results of UP-DETR can be found in this gist.

Notebook

We provide a notebook in colab to get the visualization result in the paper:

Visualization Notebook: This notebook shows how to perform query patch detection with the pre-training model (without any annotations fine-tuning).

License

UP-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
datasets		datasets
models		models
util		util
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
detr_main.py		detr_main.py
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt
visualization.ipynb		visualization.ipynb

License

dddzg/up-detr

Folders and files

Latest commit

History

Repository files navigation

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Model Zoo

Usage - Object Detection

Unsupervised Pre-training

Data Preparation

Pre-training

Fine-tuning

Data Preparation

Fine-tuning

Evaluation

Notebook

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages