A Speaker Verification System Based on a Modified MLP-Mixer Student Model for Transformer Compression

  • Jungwoo Heo
  • , Hyun Seo Shin
  • , Chan Yeong Lim
  • , Kyo Won Koo
  • , Seung Bin Kim
  • , Jisoo Son
  • , Ha Jin Yu

Research output: Contribution to journalArticlepeer-review

Abstract

Speaker verification (SV) systems have recently achieved remarkable progress through self-supervised learning (SSL) models such as Wav2Vec2 and WavLM. Despite their strong performance, these Transformer-based models remain computationally intensive, limiting their applicability in mobile or real-time applications. Knowledge distillation has been employed to compress large SSL teachers into lightweight student models, yet most existing approaches retain Transformer architectures, thereby preserving the quadratic cost of attention mechanisms. In this work, we investigate MLP-Mixer–based student models as an alternative to Transformer students for SSL knowledge distillation. We propose a speech-oriented redesign of the MLP-Mixer that incorporates 1D convolution in the token-mixing stage to capture local temporal dependencies, and a Max-Feature-Map operation in the channel-mixing stage to emphasize speaker-discriminative frequency bands. Distilling knowledge from a WavLM teacher to this modified MLP-Mixer addresses the challenges of cross-architecture distillation. Extensive experiments on diverse benchmarks, including VoxCeleb, VCMix, VoxSRC, and VOiCES, demonstrate that the proposed approach achieves superior verification performance and computational efficiency compared to Transformer-based students, highlighting its potential for practical deployment.

Original languageEnglish
Pages (from-to)190371-190378
Number of pages8
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

Keywords

  • MLP-mixer
  • Speaker verification
  • knowledge distillation
  • transformers

Fingerprint

Dive into the research topics of 'A Speaker Verification System Based on a Modified MLP-Mixer Student Model for Transformer Compression'. Together they form a unique fingerprint.

Cite this