Subject-driven Text-to-Image Generation via

Apprenticeship Learning

Wenhu Chen^*¹

Hexiang Hu^*¹

Yandong Li²

Nataniel Ruiz²

Xuhui Jia²

Ming-Wei Chang¹

William W. Cohen¹

(*: Core Contribution)

SuTI is an in-context subject-driven text-to-image generator that draws subject-specific images without fine-tuning

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an `expert model' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by massive amount of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train massive amount of expert models specialized on different subjects. The apprentice model SuTI then learns to mimic the behavior of these experts through the proposed apprenticeship learning algorithm. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI can significantly outperform existing approaches like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen while performing on par with DreamBooth.

Zero-shot subjec-specific generations on the DreamBench-v2

Given a few demonstrations from the users, SuTI is able to generate subject-specific images instantly (~ 20-30 seconds on TPUv4), and without fine-tuning

SuTI behaves adaptively as the # of in-context examples increases

When # of in-context examples = 0, SuTI generates its text-only prior;
When # of in-context examples = 1, SuTI behaves similar to image editing model;
When # of in-context examples > 1, SuTI generates novel poses/views of the subject.

SuTI not only re-contextualizes subject images but can compose multiple transformations together

SuTI has a set of diverse skills in generating high-quality subject-specific images

In addition to re-contextualize, SuTI can accessorize, stylize, colorize and synthesize novel view for subject-specific images

Failure modes of the SuTI

SuTI often fails when subject appearance deviates too much from its prior, or when the text depicts rare visual scene

Cite us using the following BibTex

      
      @article{chen2023suti,
        title={Subject-driven Text-to-Image Generation via Apprenticeship Learning}, 
        author={Chen, Wenhu and Hu, Hexiang and Li, Yandong and Ruiz, Nataniel
                and Jia, Xuhui and Chang, Ming-Wei and Cohen, William W},
        journal={arXiv preprint arXiv:2304.00186},
        year={2023},
      }

Special Thanks

We thank Boqing Gong, Kaifeng Chen, Steve Seitz, Nicole Brichtova, Andrew Bunner and Jason Baldridge for reviewing on an early version of this paper in depth, with valuable comments and suggestions. We thank Yang Zhao and Shiran Zada for the tons of help and supports on running baseline on the proposed dreambench-v2 dataset. We also thank Kenton Lee for discussions and feedback on the project. Additionally, we thank OVEN and OPERA Team for providing their website template.