Abstract
Movie production and investment are associated with a high level of risk, motivating machine learning research to predict box-office revenue. Furthermore, identifying variables that have a significant influence on box-office revenue may aid in human decision-making. In this study, we collect a large movie dataset, including user-generated keywords and movie posters, and integrate these modalities to better predict box-office revenue. We utilize visual information from movie posters to visually ground the movie keywords, thereby acquiring more semantically precise text representations, resulting in a substantial 14.5% enhancement in box-office prediction accuracy. Also, we develop metrics to quantify content similarity based on the keywords, facilitating the identification of “copycat movies,” a term that can be extended beyond traditional sequels and franchise movies. Subsequently, we analyze the importance of copycat features in box-office revenue prediction using two explanatory methods: Attention Rollout and LIME. Our analyses show the importance of copycat features in box-office prediction and reveal a positive relationship between copycat movies and box-office revenues. However, this effect diminishes with an increase in the number of similar movies and the similarity of their content. Overall, our work establishes a comprehensive process of predicting movie box-office revenue by utilizing multi-modal data and providing valuable business insights.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Multimedia |
| DOIs | |
| State | Accepted/In press - 2026 |
Keywords
- box-office prediction
- content similarity
- copycat movies
- model interpretability
- movie keywords
- movie posters
- visually grounded textual representation
Fingerprint
Dive into the research topics of 'Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver