Which to select? Analysis of speaker representation with graph attention networks

Hye Jin Shim, Jee Weon Jung, Ha Jin Yu

Research output: Contribution to journalArticlepeer-review

Abstract

Although the recent state-of-the-art systems show almost perfect performance, analysis of speaker embeddings has been lacking thus far. An in-depth analysis of speaker representation will be performed by looking into which features are selected. To this end, various intermediate representations of the trained model are observed using graph attentive feature aggregation, which includes a graph attention layer and graph pooling layer followed by a readout operation. To do so, the TIMIT dataset, which has comparably restricted conditions (e.g., the region and phoneme) is used after pre-training the model on the VoxCeleb dataset and then freezing the weight parameters. Through extensive experiments, there is a consistent trend in speaker representation in that the models learn to exploit sequence and phoneme information despite no supervision in that direction. The results shed light to help understand speaker embedding, which is yet considered to be a black box.

Original languageEnglish
Pages (from-to)2701-2708
Number of pages8
JournalJournal of the Acoustical Society of America
Volume156
Issue number4
DOIs
StatePublished - 1 Oct 2024

Fingerprint

Dive into the research topics of 'Which to select? Analysis of speaker representation with graph attention networks'. Together they form a unique fingerprint.

Cite this