Code for paper "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training"

Code for paper "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training"
Abstract: Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{this https URL}{URL}.

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP)

This repo contains the source code of our ECCV 2022 paper MS-CLIP:

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
2022 European Conference on Computer Vision (ECCV 2022)
By Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan.

Introduction

MS-CLIP

We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that lightweight modality-specific parallel modules further improve performance.

MS-CLIP-S

Update

  • [07/20/2022] Released pretrained model and zero-shot evaluation on ImageNet-1k.

Pre-trained Weights

Model Training Set Top-1 on IN-1K LP* on 24 datasets Download
MS-CLIP-S (ViT-B/32) YFCC-22M 36.7 68.5 ckpt/config
MS-CLIP-S (ViT-B/16) YFCC-22M 39.0 70.4 ckpt/config
MS-CLIP-S (ViT-B/32) LAION-20M 40.2 73.3 ckpt/config

*LP: Linear Probing

Getting Started

Installation

Please follow INSTALL.md for installation

Data preparation

Please follow DATA.md for data preparation.

Pre-trained weights preparation

Download from the links in the table above. Put the weights under ./OUTPUT_MODEL/.

Evaluation

To evaluate a pre-trained MS-CLIP-S on ImageNet Zero-shot Classification, run:

CUDA_VISIBLE_DEVICES=0 python tools/eval_zeroshot.py --model <config-file> 

where <config-file> is the config yaml under experiments/model/. E.g. experiments/model/b32-laion-msclips.yaml

Contact

If you have any questions, please contact Haoxuan You or Luowei Zhou.

Download Source Code

Download ZIP

Paper Preview

Aug 20, 2022