FA-ExU-Net: The Simultaneous Training of an Embedding Extractor and Enhancement Model for a Speaker Verification System Robust to Short Noisy Utterances

Ju Ho Kim, Jungwoo Heo, Hyun Seo Shin, Chan Yeong Lim, Ha Jin Yu

Research output: Contribution to journalArticlepeer-review

Abstract

Speaker verification (SV) technology has the potential to enhance personalization and security in various applications, such as voice assistants, forensics, and access control. However, several challenges hinder the practical application of SV systems, including limitations and distortions in speaker information due to short utterances and noisy environments. Furthermore, these two factors often coexist in real-world situations, resulting in a significant performance degradation of SV systems. Despite the significance of these obstacles, each factor is independently studied, and the co-occurrence of both factors is rarely investigated. Here, we propose a novel SV framework, feature aggregated extended U-Net (FA-ExU-Net), which simultaneously addresses both the challenges by building on the success of prior research on each factor. The FA-ExU-Net incorporates an iterative and hierarchical feature aggregation scheme, a target task-specific feature enhancement module, and a multi-scale feature aggregator for extracting information-rich embeddings. Our proposed system outperforms the recent baseline models based on four evaluation criteria: generalizability, short utterance performance, capacity to handle noisy environments, and robustness to short utterances in noisy environments. We demonstrate the effectiveness of the proposed model through comparison and ablation experiments and intuitive visualizations. The proposed novel approach is expected to contribute to the development of more robust and accurate SV models for practical applications.

Original languageEnglish
Pages (from-to)2269-2282
Number of pages14
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
StatePublished - 2024

Keywords

  • feature aggregation
  • feature enhancement
  • joint training
  • noisy environments
  • short-utterances
  • Speaker verification

Fingerprint

Dive into the research topics of 'FA-ExU-Net: The Simultaneous Training of an Embedding Extractor and Enhancement Model for a Speaker Verification System Robust to Short Noisy Utterances'. Together they form a unique fingerprint.

Cite this