TY - CONF
T1 - Align-and-Attend Network for Globally and Locally Coherent Video Inpainting
AU - Woo, Sanghyun
AU - Kim, Dahun
AU - Park, Kwanyong
AU - Lee, Joon Young
AU - Kweon, In So
N1 - Publisher Copyright:
© 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.
PY - 2020
Y1 - 2020
N2 - Video inpainting is more challenging than image inpainting because of the extra temporal dimension. It requires inpainted contents to be globally coherent in both space and time. A natural solution for this problem is aggregating features from other frames, and thus, existing state-of-the-art methods rely heavily on 3D convolution or optical flow. However, these methods emphasize more on the temporally nearby frames, and long-term temporal information is not sufficiently stressed. In this work, we propose a novel two-stage alignment method. The first stage is an alignment module that uses computed homography between the target frame and the reference frames. The visible patches are then aggregated based on the frame similarity to roughly fill in the target holes. The second stage is an attention module that matches the generated patches with known reference patches in a non-local manner to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. The proposed model can even handle challenging scenes with large or slowly moving holes, which have been hardly modeled by existing approaches. Experiments on video object removal demonstrate that our method significantly outperforms previous state-of-the-art learning approaches.
AB - Video inpainting is more challenging than image inpainting because of the extra temporal dimension. It requires inpainted contents to be globally coherent in both space and time. A natural solution for this problem is aggregating features from other frames, and thus, existing state-of-the-art methods rely heavily on 3D convolution or optical flow. However, these methods emphasize more on the temporally nearby frames, and long-term temporal information is not sufficiently stressed. In this work, we propose a novel two-stage alignment method. The first stage is an alignment module that uses computed homography between the target frame and the reference frames. The visible patches are then aggregated based on the frame similarity to roughly fill in the target holes. The second stage is an attention module that matches the generated patches with known reference patches in a non-local manner to refine the previous global alignment stage. Both stages consist of large spatial-temporal window size for the reference and thus enable modeling long-range correlations between distant information and the hole regions. The proposed model can even handle challenging scenes with large or slowly moving holes, which have been hardly modeled by existing approaches. Experiments on video object removal demonstrate that our method significantly outperforms previous state-of-the-art learning approaches.
UR - http://www.scopus.com/inward/record.url?scp=85139578210&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85139578210
T2 - 31st British Machine Vision Conference, BMVC 2020
Y2 - 7 September 2020 through 10 September 2020
ER -