Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings: Paper and Code

Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Oct 25, 2018
Jee-weon Jung, Hee-soo Heo, Hye-jin Shim, Ha-jin Yu

Figure 1 for Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Figure 2 for Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Figure 3 for Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Figure 4 for Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Share this with someone who'll enjoy it:

Input utterance with short duration is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short durations of 2.05 seconds. For this goal, we propose an approach using a teacher-student learning framework that maximizes the cosine similarity of two speaker embeddings extracted from long and short utterances. In the proposed architecture, phonetic-level features in which each feature represents a segment of 130 ms are extracted using convolutional layers. The gated recurrent units extract an utterance-level speaker embedding using the phonetic-level features. Experiments were conducted using deep neural networks that take raw waveforms as input, and output speaker embeddings on the VoxCeleb 1 dataset. The equal error rates without short utterance compensation are 8.72 % and 12.8 %, for evaluation sets with durations of 3.59 s and 2.05 s, respectively. The proposed model with compensation exhibits an equal error rate of 10.08 % for 2.05 s utterances, which compensates more than 65 % of the performance degradation.

* 5 pages, 2 figures, submitted to ICASSP 2019 as a conference paper

View paper on

Share this with someone who'll enjoy it: