Code for paper "Exploiting Unlabeled Data with Vision and Language Models for Object Detection"

Abstract: Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at this https URL.

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Official implementation of Exploiting unlabeled data with vision and language models for object detection.

arXiv, Project


Our project is developed on Detectron2. Please follow the official installation instructions.

Data Preparation

Download the COCO dataset, and put it in the datasets/ directory.

Download our pre-generated pseudo-labeled data, and put them in the datasets/open_voc directory.

Dataset are organized in the following way:


Note: You may generate and evaluate pseudo labels on your own by following our pseudo label generation instruction

Evaluation with pre-trained models

Mask R-CNN:

Training Method Novel AP Base AP Overall AP download
With LSJ 34.4 60.2 53.5 model
W/O LSJ 32.3 54.0 48.3 model
python -m --config configs/coco_openvoc_LSJ.yaml  --num-gpus=1 --eval-only --resume


The best model on COCO in the paper is trained with large scale Jitter (LSJ), but training with LSJ requires too many GPU memories. Thus, beside the LSJ version, we also provide training without LSJ.

Training Mask R-CNN with Large Scale Jitter (LSJ).

python --config configs/coco_openvoc_LSJ.yaml  --num-gpus=8 --use_lsj

Training Mask R-CNN without Large Scale Jitter (LSJ).

python --config configs/coco_openvoc_mask_rcnn.yaml  --num-gpus=8

Citing VL-PLM

If you use VL-PLM in your work or wish to refer to the results published in this repo, please cite our paper:

Download Source Code

Download ZIP

Paper Preview

Aug 19, 2022