TY - JOUR
T1 - Avoiding speaker overfitting in end-to-end DNNs using raw waveform for text-independent speaker verification
AU - Jung, Jee Weon
AU - Heo, Hee Soo
AU - Yang, Il Ho
AU - Shim, Hye Jin
AU - Yu, Ha Jin
N1 - Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.
PY - 2018
Y1 - 2018
N2 - In this research, we propose a novel raw waveform end-to-end DNNs for text-independent speaker verification. For speaker verification, many studies utilize the speaker embedding scheme, which trains deep neural networks as speaker identifiers to extract speaker features. However, this scheme has an intrinsic limitation in which the speaker feature, trained to classify only known speakers, is required to represent the identity of unknown speakers. Owing to this mismatch, speaker embedding systems tend to well generalize towards unseen utterances from known speakers, but are overfitted to known speakers. This phenomenon is referred to as speaker overfitting. In this paper, we investigated regularization techniques, a multi-step training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements. Technique effectiveness is evaluated using the VoxCeleb dataset, which comprises over 1,200 speakers from various uncontrolled environments. To the best of our knowledge, we are the first to verify the success of end-to-end DNNs directly using raw waveforms in text-independent scenario. It shows an equal error rate of 7.4%, which is lower than i-vector/probabilistic linear discriminant analysis and end-to-end DNNs that use spectrograms.
AB - In this research, we propose a novel raw waveform end-to-end DNNs for text-independent speaker verification. For speaker verification, many studies utilize the speaker embedding scheme, which trains deep neural networks as speaker identifiers to extract speaker features. However, this scheme has an intrinsic limitation in which the speaker feature, trained to classify only known speakers, is required to represent the identity of unknown speakers. Owing to this mismatch, speaker embedding systems tend to well generalize towards unseen utterances from known speakers, but are overfitted to known speakers. This phenomenon is referred to as speaker overfitting. In this paper, we investigated regularization techniques, a multi-step training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements. Technique effectiveness is evaluated using the VoxCeleb dataset, which comprises over 1,200 speakers from various uncontrolled environments. To the best of our knowledge, we are the first to verify the success of end-to-end DNNs directly using raw waveforms in text-independent scenario. It shows an equal error rate of 7.4%, which is lower than i-vector/probabilistic linear discriminant analysis and end-to-end DNNs that use spectrograms.
KW - End-to-end
KW - Raw waveform
KW - Speaker embedding
KW - Speaker overfitting
KW - Speaker verification
UR - http://www.scopus.com/inward/record.url?scp=85054982820&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2018-1608
DO - 10.21437/Interspeech.2018-1608
M3 - Conference article
AN - SCOPUS:85054982820
SN - 1990-9772
VL - 2018-September
SP - 3583
EP - 3587
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018
Y2 - 2 September 2018 through 6 September 2018
ER -