Abstract
The variation of utterance lengths is a representative factor that can degrade the performance of speaker verification systems. To handle this issue, previous studies had attempted to extract speaker features from various branches or to use convolution layers with different receptive fields. Combining the advantages of the previous two approaches for variable-length input, this paper proposes integrated receptive field diversification that extracts speaker features through more diverse receptive field. The proposed method processes the input features by convolutional layers with different receptive fields at multiple time-axis branches, and extracts speaker embedding by dynamically aggregating the processed features according to the lengths of input utterances. The deep neural networks in this study were trained on the VoxCeleb2 dataset and tested on the VoxCeleb1 evaluation dataset that divided into 1 s, 2 s, 5 s, and full-length. Experimental results demonstrated that the proposed method reduces the equal error rate by 19.7 % compared to the baseline.
Original language | English |
---|---|
Pages (from-to) | 319-325 |
Number of pages | 7 |
Journal | Journal of the Acoustical Society of Korea |
Volume | 41 |
Issue number | 3 |
DOIs | |
State | Published - 2022 |
Keywords
- Deep neural network
- Receptive field
- Speaker verification
- Variable-length utterance