TY - GEN
T1 - Bridging the Lexical Gap
T2 - 2nd International Workshop on Deep Multimodal Generation and Retrieval, MMGR 2024
AU - Hwang, Hyesu
AU - Kim, Daeun
AU - Park, Jaehui
AU - Kwon, Yongjin
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Retrieving relevant images based on text is challenging due to the non-trivial nature of aligning vision and language representations. Large-scale vision-language models such as CLIP are widely used in recent studies to leverage the pre-trained knowledge of the alignment. However, our observations reveal a performance decrease of 60.8% for verb, adjective, and adverb queries in contrast to noun queries. With preliminary studies, we found that there is an insufficient alignment between image and text regarding specific parts of speech in the popular vision-language models. We also observed that nouns have a high influence on the text-to-image retrieval results of vision-language models. Based on this, this paper proposes a method to generate noun-based queries as part of rewriting queries. First, a large language model extracts nouns that are relevant to the initial query and generates a hypothetical query that best matches the parts of speech alignment in the vision-language model. Then, we verify whether the hypothetical query preserves the original intent of the query and iteratively rewrite it. Our experiments show that our method can significantly enhance text-to-image retrieval performance and highlight the understanding of lexical knowledge in the vision-language models.
AB - Retrieving relevant images based on text is challenging due to the non-trivial nature of aligning vision and language representations. Large-scale vision-language models such as CLIP are widely used in recent studies to leverage the pre-trained knowledge of the alignment. However, our observations reveal a performance decrease of 60.8% for verb, adjective, and adverb queries in contrast to noun queries. With preliminary studies, we found that there is an insufficient alignment between image and text regarding specific parts of speech in the popular vision-language models. We also observed that nouns have a high influence on the text-to-image retrieval results of vision-language models. Based on this, this paper proposes a method to generate noun-based queries as part of rewriting queries. First, a large language model extracts nouns that are relevant to the initial query and generates a hypothetical query that best matches the parts of speech alignment in the vision-language model. Then, we verify whether the hypothetical query preserves the original intent of the query and iteratively rewrite it. Our experiments show that our method can significantly enhance text-to-image retrieval performance and highlight the understanding of lexical knowledge in the vision-language models.
KW - Generative Retrieval
KW - Large Language Model
KW - Text-to-Image Retrieval
KW - Vision-Language Model
UR - http://www.scopus.com/inward/record.url?scp=85210825101&partnerID=8YFLogxK
U2 - 10.1145/3689091.3690089
DO - 10.1145/3689091.3690089
M3 - Conference contribution
AN - SCOPUS:85210825101
T3 - MMGR 2024 - Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval
SP - 25
EP - 33
BT - MMGR 2024 - Proceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -