A simple distortion-free method to handle variable length sequences for recurrent neural networks in text dependent speaker verification

Sung Hyun Yoon, Ha Jin Yu

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Recurrent neural networks (RNNs) can model the time-dependency of time-series data. It has also been widely used in text-dependent speaker verification to extract speaker-andphrase-discriminant embeddings. As with other neural networks, RNNs are trained in mini-batch units. In order to feed input sequences into an RNN in mini-batch units, all the sequences in each mini-batch must have the same length. However, the sequences have variable lengths and we have no knowledge of these lengths in advance. Truncation/padding are most commonly used to make all sequences the same length. However, the truncation/padding causes information distortion because some information is lost and/or unnecessary information is added, which can degrade the performance of text-dependent speaker verification. In this paper, we propose a method to handle variable length sequences for RNNs without adding information distortion by truncating the output sequence so that it has the same length as corresponding original input sequence. The experimental results for the text-dependent speaker verification task in part 2 of RSR 2015 show that our method reduces the relative equal error rate by approximately 1.3% to 27.1%, depending on the task, compared to the baselines but with an associated, small overhead in execution time.

Original languageEnglish
Article number4092
Pages (from-to)1-14
Number of pages14
JournalApplied Sciences (Switzerland)
Volume10
Issue number12
DOIs
StatePublished - 1 Jun 2020

Keywords

  • Padding
  • Recurrent neural network
  • Text-dependent speaker verification
  • Truncation
  • Variable length

Fingerprint

Dive into the research topics of 'A simple distortion-free method to handle variable length sequences for recurrent neural networks in text dependent speaker verification'. Together they form a unique fingerprint.

Cite this