TY - JOUR
T1 - Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker Verification
AU - Lim, Chan Yeong
AU - Shin, Hyun Seo
AU - Kim, Ju Ho
AU - Heo, Jungwoo
AU - Koo, Kyo Won
AU - Kim, Seung Bin
AU - Yu, Ha Jin
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Adopting self-supervised pre-trained models (PMs) in speaker verification (SV) has shown remarkable performance, but their noise robustness is largely unexplored. In the field of automatic speech recognition, additional training strategies enhance the robustness of the models before fine-tuning to improve performance in noisy environments. However, directly applying these strategies to SV risks distorting speaker information. We propose a noise adaptive warm-up training for speaker verification (NAW-SV). The NAW-SV guides the PM to extract consistent representations in noisy conditions using teacher-student learning. In this approach, to prevent the speaker information distortion problem, we introduce a novel loss function called extended angular prototypical network loss, which assists in considering speaker information and exploring robust speaker embedding space. We validated our proposed framework on the noise-synthesized VoxCeleb1 test set, demonstrating promising robustness.
AB - Adopting self-supervised pre-trained models (PMs) in speaker verification (SV) has shown remarkable performance, but their noise robustness is largely unexplored. In the field of automatic speech recognition, additional training strategies enhance the robustness of the models before fine-tuning to improve performance in noisy environments. However, directly applying these strategies to SV risks distorting speaker information. We propose a noise adaptive warm-up training for speaker verification (NAW-SV). The NAW-SV guides the PM to extract consistent representations in noisy conditions using teacher-student learning. In this approach, to prevent the speaker information distortion problem, we introduce a novel loss function called extended angular prototypical network loss, which assists in considering speaker information and exploring robust speaker embedding space. We validated our proposed framework on the noise-synthesized VoxCeleb1 test set, demonstrating promising robustness.
KW - noisy environments
KW - self-supervised learning
KW - speaker verification
KW - teacher-student learning
UR - https://www.scopus.com/pages/publications/85214835126
U2 - 10.21437/Interspeech.2024-1630
DO - 10.21437/Interspeech.2024-1630
M3 - Conference article
AN - SCOPUS:85214835126
SN - 1990-9772
SP - 2665
EP - 2669
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -