Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining

Qin Chao, Eunsoo Kim, Boyang Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Investments in movie production are associated with a high level of risk as movie revenues have long-tailed and bimodal distributions [1]. Accurate prediction of box-office revenue may mitigate the uncertainty and encourage investment. However, learning effective representations for actors, directors, and user-generated content-related keywords remains a challenging open problem. In this work, we investigate the effects of self-supervised pretraining and propose visual grounding of content keywords in objects from movie posters as a pretraining objective. Experiments on a large dataset of 35,794 movies demonstrate significant benefits of self-supervised training and visual grounding. In particular, visual grounding pretraining substantially improves learning on movies with content keywords and achieves 14.5% relative performance gains compared to a finetuned BERT model with identical architecture.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PublisherIEEE Computer Society
Pages1535-1540
Number of pages6
ISBN (Electronic)9781665468916
DOIs
StatePublished - 2023
Event2023 IEEE International Conference on Multimedia and Expo, ICME 2023 - Brisbane, Australia
Duration: 10 Jul 202314 Jul 2023

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2023-July
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2023 IEEE International Conference on Multimedia and Expo, ICME 2023
Country/TerritoryAustralia
CityBrisbane
Period10/07/2314/07/23

Keywords

  • Box Office Prediction
  • Movie Revenue Prediction
  • Multimodal Learning
  • Self-supervised Learning
  • Visual Grounding

Fingerprint

Dive into the research topics of 'Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining'. Together they form a unique fingerprint.

Cite this