A study of evaluation metrics and datasets for video captioning

Jaehui Park, Chibon Song, Ji Hyeong Han

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

With the fast growing interest in deep learning, various applications and machine learning tasks are emerged in recent years. Video captioning is especially gaining a lot of attention from both computer vision and natural language processing fields. Generating captions is usually performed by jointly learning of different types of data modalities that share common themes in the video. Learning with the joining representations of different modalities is very challenging due to the inherent heterogeneity resided in the mixed information of visual scenes, speech dialogs, music and sounds, and etc. Consequently, it is hard to evaluate the quality of video captioning results. In this paper, we introduce well-known metrics and datasets for evaluation of video captioning. We compare the the existing metrics and datasets to derive a new research proposal for the evaluation of video descriptions.

Original languageEnglish
Title of host publicationICIIBMS 2017 - 2nd International Conference on Intelligent Informatics and Biomedical Sciences
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages172-175
Number of pages4
ISBN (Electronic)9781509066643
DOIs
StatePublished - 2 Jul 2017
Event2nd International Conference on Intelligent Informatics and Biomedical Sciences, ICIIBMS 2017 - Okinawa, Japan
Duration: 24 Nov 201726 Nov 2017

Publication series

NameICIIBMS 2017 - 2nd International Conference on Intelligent Informatics and Biomedical Sciences
Volume2018-January

Conference

Conference2nd International Conference on Intelligent Informatics and Biomedical Sciences, ICIIBMS 2017
Country/TerritoryJapan
CityOkinawa
Period24/11/1726/11/17

Keywords

  • Benchmark dataseis
  • Movie captioning
  • Video captioning
  • Video to text

Fingerprint

Dive into the research topics of 'A study of evaluation metrics and datasets for video captioning'. Together they form a unique fingerprint.

Cite this