Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models

Hyesu Hwang, Daeun Kim, Jaehui Park, Yongjin Kwon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Retrieving relevant images based on text is challenging due to the non-trivial nature of aligning vision and language representations. Large-scale vision-language models such as CLIP are widely used in recent studies to leverage the pre-trained knowledge of the alignment. However, our observations reveal a performance decrease of 60.8% for verb, adjective, and adverb queries in contrast to noun queries. With preliminary studies, we found that there is an insufficient alignment between image and text regarding specific parts of speech in the popular vision-language models. We also observed that nouns have a high influence on the text-to-image retrieval results of vision-language models. Based on this, this paper proposes a method to generate noun-based queries as part of rewriting queries. First, a large language model extracts nouns that are relevant to the initial query and generates a hypothetical query that best matches the parts of speech alignment in the vision-language model. Then, we verify whether the hypothetical query preserves the original intent of the query and iteratively rewrite it. Our experiments show that our method can significantly enhance text-to-image retrieval performance and highlight the understanding of lexical knowledge in the vision-language models.

Original languageEnglish
Title of host publicationMMGR 2024 - Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages25-33
Number of pages9
ISBN (Electronic)9798400712029
DOIs
StatePublished - 28 Oct 2024
Event2nd International Workshop on Deep Multimodal Generation and Retrieval, MMGR 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMMGR 2024 - Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval

Conference

Conference2nd International Workshop on Deep Multimodal Generation and Retrieval, MMGR 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • Generative Retrieval
  • Large Language Model
  • Text-to-Image Retrieval
  • Vision-Language Model

Fingerprint

Dive into the research topics of 'Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models'. Together they form a unique fingerprint.

Cite this