Video multitask transformer network

Hongje Seong, Junhyuk Hyun, Euntai Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

25 Scopus citations

Abstract

In this paper, we propose the Multitask Transformer Network for multitasking on untrimmed video. To analyze the untrimmed video, it needs to capture important frame and region in the spatio-temporal domain. Therefore, we utilize the Transformer Network, which can capture the useful features from CNN representations through an attention mechanism. Motivated by the Action Transformer Network, which is a repurposed model of the Transformer for video, we modified the concept of query which was specialized only for action recognition on the trimmed video to fit the untrimmed video. In addition, we modified the structure of the Transformer unit to the pre-activation structure for identity mapping on residual connections. We also utilize the class conversion matrix (CCM), one of the feature fusion methods, to share the information of different tasks. Combining our Transformer structure and CCM, the Multitask Transformer Network is proposed for multitasking on untrimmed video. Eventually, our model evaluated on CoVieW 2019, and we enhanced the performance through post-processing based on prediction results that suitable to the CoVieW 2019 evaluation metric. In CoVieW 2019 challenge, we placed fourth on final rank while first on scene and action score.

Original languageEnglish
Title of host publicationProceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1553-1561
Number of pages9
ISBN (Electronic)9781728150239
DOIs
StatePublished - Oct 2019
Event17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019 - Seoul, Korea, Republic of
Duration: 27 Oct 201928 Oct 2019

Publication series

NameProceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019

Conference

Conference17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019
Country/TerritoryKorea, Republic of
CitySeoul
Period27/10/1928/10/19

Keywords

  • Action recognition
  • Multitasking
  • Scene recognition
  • Untrimmed video

Fingerprint

Dive into the research topics of 'Video multitask transformer network'. Together they form a unique fingerprint.

Cite this