TY - JOUR
T1 - FA-ExU-Net
T2 - The Simultaneous Training of an Embedding Extractor and Enhancement Model for a Speaker Verification System Robust to Short Noisy Utterances
AU - Kim, Ju Ho
AU - Heo, Jungwoo
AU - Shin, Hyun Seo
AU - Lim, Chan Yeong
AU - Yu, Ha Jin
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2024
Y1 - 2024
N2 - Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.
AB - Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.
KW - feature aggregation
KW - feature enhancement
KW - joint training
KW - noisy environments
KW - short-utterances
KW - Speaker verification
UR - http://www.scopus.com/inward/record.url?scp=85188909978&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2024.3381005
DO - 10.1109/TASLP.2024.3381005
M3 - Article
AN - SCOPUS:85188909978
SN - 2329-9290
VL - 32
SP - 2269
EP - 2282
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -