Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Yang Chen^1,2*

Hexiang Hu¹

Yi Luan¹

Haitian Sun¹

Soravit Changpinyo¹

Alan Ritter²

Ming-Wei Chang¹

(*: Work done when author was interned at Google)

Large language models have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training, do these models also understand how to answer visual information seeking questions? To answer this question, we present InfoSeek, a Visual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking question-answer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation. Based on InfoSeek, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models.

InfoSeek, A New VQA Benchmark focuses on Visual Info-Seeking Questions

Special Thanks

We thank Jialin Wu, Luowei Zhou for reviewing an early version of this paper. We thank Xi Chen for providing different variants of PaLI pre-trained checkpoints. We also thank Radu Soricut, Anelia Angelova, Andre Araujo, Vittorio Ferrari for valuable discussions and feedback on the project. We thank Huiwen Chang and the Muse Team for providing their website template. Yang Chen is partially funded by the NSF (IIS-2052498).