Avoiding speaker overfitting in end-to-end DNNs using raw waveform for text-independent speaker verification

Jee Weon Jung, Hee Soo Heo, Il Ho Yang, Hye Jin Shim, Ha Jin Yu

Research output: Contribution to journalConference articlepeer-review

23 Scopus citations


In this research, we propose a novel raw waveform end-to-end DNNs for text-independent speaker verification. For speaker verification, many studies utilize the speaker embedding scheme, which trains deep neural networks as speaker identifiers to extract speaker features. However, this scheme has an intrinsic limitation in which the speaker feature, trained to classify only known speakers, is required to represent the identity of unknown speakers. Owing to this mismatch, speaker embedding systems tend to well generalize towards unseen utterances from known speakers, but are overfitted to known speakers. This phenomenon is referred to as speaker overfitting. In this paper, we investigated regularization techniques, a multi-step training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements. Technique effectiveness is evaluated using the VoxCeleb dataset, which comprises over 1,200 speakers from various uncontrolled environments. To the best of our knowledge, we are the first to verify the success of end-to-end DNNs directly using raw waveforms in text-independent scenario. It shows an equal error rate of 7.4%, which is lower than i-vector/probabilistic linear discriminant analysis and end-to-end DNNs that use spectrograms.

Original languageEnglish
Pages (from-to)3583-3587
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2 Sep 20186 Sep 2018


  • End-to-end
  • Raw waveform
  • Speaker embedding
  • Speaker overfitting
  • Speaker verification


Dive into the research topics of 'Avoiding speaker overfitting in end-to-end DNNs using raw waveform for text-independent speaker verification'. Together they form a unique fingerprint.

Cite this