TY - GEN
T1 - HM-CONFORMER
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
AU - Shin, Hyun Seo
AU - Heo, Jungwoo
AU - Kim, Ju Ho
AU - Lim, Chan Yeong
AU - Kim, Wonbin
AU - Yu, Ha Jin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
AB - Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
KW - Anti-spoofing
KW - Audio deepfake detection
KW - Conformer
KW - Hierarchical pooling
KW - Multi-level classification token aggregation
UR - http://www.scopus.com/inward/record.url?scp=85198551761&partnerID=8YFLogxK
U2 - 10.1109/ICASSP48485.2024.10448453
DO - 10.1109/ICASSP48485.2024.10448453
M3 - Conference contribution
AN - SCOPUS:85198551761
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 10581
EP - 10585
BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 April 2024 through 19 April 2024
ER -