Abstract
Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.
| Original language | English |
|---|---|
| Pages (from-to) | 2269-2282 |
| Number of pages | 14 |
| Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
| Volume | 32 |
| DOIs | |
| State | Published - 2024 |
Keywords
- feature aggregation
- feature enhancement
- joint training
- noisy environments
- short-utterances
- Speaker verification
Fingerprint
Dive into the research topics of 'FA-ExU-Net: The Simultaneous Training of an Embedding Extractor and Enhancement Model for a Speaker Verification System Robust to Short Noisy Utterances'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver