Abstract
Recurrent neural networks (RNNs) can model the time-dependency of time-series data. It has also been widely used in text-dependent speaker verification to extract speaker-andphrase-discriminant embeddings. As with other neural networks, RNNs are trained in mini-batch units. In order to feed input sequences into an RNN in mini-batch units, all the sequences in each mini-batch must have the same length. However, the sequences have variable lengths and we have no knowledge of these lengths in advance. Truncation/padding are most commonly used to make all sequences the same length. However, the truncation/padding causes information distortion because some information is lost and/or unnecessary information is added, which can degrade the performance of text-dependent speaker verification. In this paper, we propose a method to handle variable length sequences for RNNs without adding information distortion by truncating the output sequence so that it has the same length as corresponding original input sequence. The experimental results for the text-dependent speaker verification task in part 2 of RSR 2015 show that our method reduces the relative equal error rate by approximately 1.3% to 27.1%, depending on the task, compared to the baselines but with an associated, small overhead in execution time.
Original language | English |
---|---|
Article number | 4092 |
Pages (from-to) | 1-14 |
Number of pages | 14 |
Journal | Applied Sciences (Switzerland) |
Volume | 10 |
Issue number | 12 |
DOIs | |
State | Published - 1 Jun 2020 |
Keywords
- Padding
- Recurrent neural network
- Text-dependent speaker verification
- Truncation
- Variable length