TY - JOUR
T1 - Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels
AU - Seong, Hongje
AU - Hyun, Junhyuk
AU - Kim, Euntai
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2023/2/1
Y1 - 2023/2/1
N2 - Semi-supervised video object segmentation (VOS) is to predict the segment of a target object in a video when a ground truth segmentation mask for the target is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising approach for semi-supervised VOS. However, an important point has been overlooked in applying STM to VOS: The solution (=STM) is non-local, but the problem (=VOS) is predominantly local. To solve this mismatch between STM and VOS, we propose new VOS networks called kernelized memory network (KMN) and KMN with multiple kernels (KMN$^{M}$M). Our networks conduct not only Query-to-Memory matching but also Memory-to-Query matching. In Memory-to-Query matching, a kernel is employed to reduce the degree of non-localness of the STM. In addition, we present a Hide-and-Seek strategy in pre-training to handle occlusions effectively. The proposed networks surpass the state-of-the-art results on standard benchmarks by a significant margin (+4% in $\mathcal {J_{M}}$JM on DAVIS 2017 test-dev set). The runtimes of our proposed KMN and KMN$^{M}$M on DAVIS 2016 validation set are 0.12 and 0.13 seconds per frame, respectively, and the two networks have similar computation times to STM.
AB - Semi-supervised video object segmentation (VOS) is to predict the segment of a target object in a video when a ground truth segmentation mask for the target is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising approach for semi-supervised VOS. However, an important point has been overlooked in applying STM to VOS: The solution (=STM) is non-local, but the problem (=VOS) is predominantly local. To solve this mismatch between STM and VOS, we propose new VOS networks called kernelized memory network (KMN) and KMN with multiple kernels (KMN$^{M}$M). Our networks conduct not only Query-to-Memory matching but also Memory-to-Query matching. In Memory-to-Query matching, a kernel is employed to reduce the degree of non-localness of the STM. In addition, we present a Hide-and-Seek strategy in pre-training to handle occlusions effectively. The proposed networks surpass the state-of-the-art results on standard benchmarks by a significant margin (+4% in $\mathcal {J_{M}}$JM on DAVIS 2017 test-dev set). The runtimes of our proposed KMN and KMN$^{M}$M on DAVIS 2016 validation set are 0.12 and 0.13 seconds per frame, respectively, and the two networks have similar computation times to STM.
KW - gaussian kernel
KW - hide-and-seek
KW - memory network
KW - Video object segmentation
UR - http://www.scopus.com/inward/record.url?scp=85127521076&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2022.3163375
DO - 10.1109/TPAMI.2022.3163375
M3 - Article
C2 - 35353695
AN - SCOPUS:85127521076
SN - 0162-8828
VL - 45
SP - 2595
EP - 2612
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 2
ER -