Logo MagicLens

Self-Supervised Image Retrieval with Open-Ended Instructions

1Google DeepMind, 2The Ohio State University
*Work done at Google DeepMind.
Corresponding to: zhang.13253@osu.edu
MagicLens Overview


We present MagicLens: a series of image retrieval models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, a single MagicLens model can achieve comparable or better results on 10 benchmarks of various multimodality-to-image, image-to-image, and text-to-image retrieval tasks than prior state-of-the-art (SOTA) methods. Also, MagicLens can satisfy diverse search intents expressed by open-ended instructions.

πŸ”₯We are working on releasing MagicLens models and inference code, stay tuned!πŸ”₯

Logo MagicLens Models

Data Construction

We mine naturally occuring image pairs from the same webpages, which implicitly cover diverse image relations. We utlize large multimodal models and large language models to construct 36.7M high quality triplets (query image, text instruction, target image) for model training.

data construction

Model Training

MagicLens is built upon single-modality encoders initialized from CLIP or CoCa and trained with simple contrastive loss. With a dual-encoder architecture, MagicLens can take both image and text inputs to deliver a VL embedding, thus enabling multimodal-to-image and image-to-image retrieval. Also, the bottom single-modality encoders can be re-used for text-to-image retrieval, with non-trivial performance gains.

MagicLens Training

Experiment Results

Retrieval Examples on hold-out 1.4M Images


Multimodality-to-Image Retrieval

Parameter Efficiency

Parameter vs. Performance


Image-to-Image Retrieval


Text-to-Image Retrieval


        title={MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions},
        author={Kai Zhang and Yi Luan and Hexiang Hu and Kenton Lee and Siyuan Qiao and Wenhu Chen and Yu Su and Ming-Wei Chang},
        journal={arXiv preprint arXiv:2403.19651},