Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomi Kinnunen

a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Mar 03, 2024
Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen, Nicholas Evans, Jean-Francois Bonastre, Itshak Lapidot

Figure 1 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 2 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 3 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 4 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, we propose an architecture-agnostic detection cost function (a-DCF). A generalisation of the original DCF used widely for the assessment of automatic speaker verification (ASV), the a-DCF is designed for the evaluation of spoofing-robust ASV. Like the DCF, the a-DCF reflects the cost of decisions in a Bayes risk sense, with explicitly defined class priors and detection cost model. We demonstrate the merit of the a-DCF through the benchmarking evaluation of architecturally-heterogeneous spoofing-robust ASV solutions.

* 8 pages, submitted to Speaker Odyssey 2024

Via

Access Paper or Ask Questions

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Feb 23, 2024
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment. Specifically, we modify the formant frequencies and formant bandwidths of adult speech to emulate children's speech. The modified spectra are used to train ECAPA-TDNN (emphasized channel attention, propagation, and aggregation in time-delay neural network) recognizer for children. We compare ChildAugment against various state-of-the-art data augmentation techniques for children's ASV. We also extensively compare different scoring methods, including cosine scoring, PLDA (probabilistic linear discriminant analysis), and NPLDA (neural PLDA). We also propose a low-complexity weighted cosine score for extremely low-resource children ASV. Our findings on the CSLU kids corpus indicate that ChildAugment holds promise as a simple, acoustics-motivated approach, for improving state-of-the-art deep learning based ASV for children. We achieve up to 12.45% (boys) and 11.96% (girls) relative improvement over the baseline.

* The following article has been accepted by The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at https://pubs.aip.org/asa/jasa

Via

Access Paper or Ask Questions

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Jan 28, 2024
Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.

* Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

Via

Access Paper or Ask Questions

t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Sep 21, 2023
Tomi Kinnunen, Kong Aik Lee, Hemlata Tak, Nicholas Evans, Andreas Nautsch

Figure 1 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 2 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 3 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 4 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

* To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see https://github.com/TakHemlata/T-EER (Github) and https://colab.research.google.com/drive/1ga7eiKFP11wOFMuZjThLJlkBcwEG6_4m?usp=sharing (Google Colab)

Via

Access Paper or Ask Questions

Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Jun 13, 2023
Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

Figure 1 for Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Figure 2 for Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Figure 3 for Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Figure 4 for Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The second dataset, used for addressing long-term ageing effect (up to 40 years difference) of Finnish speakers under a more controlled setup, is Longitudinal Corpus of Finnish Spoken in Helsinki (LCFSH). Our study provides new insights into the impact of speaker ageing on modern ASV systems. Specifically, we establish a quantitative measure between ageing and ASV scores. Further, our research indicates that ageing affects female English speakers to a greater degree than male English speakers, while in the case of Finnish, it has a greater impact on male speakers than female speakers.

* Interspeech 2023

Via

Access Paper or Ask Questions

Towards single integrated spoofing-aware speaker verification embeddings

Jun 01, 2023
Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

Figure 1 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 2 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 3 for Towards single integrated spoofing-aware speaker verification embeddings

Figure 4 for Towards single integrated spoofing-aware speaker verification embeddings

This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

* Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

Via

Access Paper or Ask Questions

Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Jun 01, 2023
Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen

Figure 1 for Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Figure 2 for Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Figure 3 for Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Figure 4 for Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Audio anti-spoofing for automatic speaker verification aims to safeguard users' identities from spoofing attacks. Although state-of-the-art spoofing countermeasure(CM) models perform well on specific datasets, they lack generalization when evaluated with different datasets. To address this limitation, previous studies have explored large pre-trained models, which require significant resources and time. We aim to develop a compact but well-generalizing CM model that can compete with large pre-trained models. Our approach involves multi-dataset co-training and sharpness-aware minimization, which has not been investigated in this domain. Extensive experiments reveal that proposed method yield competitive results across various datasets while utilizing 4,000 times less parameters than the large pre-trained models.

* Interspeech 2023

Via

Access Paper or Ask Questions

How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

May 31, 2023
Hye-jin Shim, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

Figure 1 for How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

Figure 2 for How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

Figure 3 for How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

Figure 4 for How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

Shortcut learning, or `Clever Hans effect` refers to situations where a learning agent (e.g., deep neural networks) learns spurious correlations present in data, resulting in biased models. We focus on finding shortcuts in deep learning based spoofing countermeasures (CMs) that predict whether a given utterance is spoofed or not. While prior work has addressed specific data artifacts, such as silence, no general normative framework has been explored for analyzing shortcut learning in CMs. In this study, we propose a generic approach to identifying shortcuts by introducing systematic interventions on the training and test sides, including the boundary cases of `near-perfect` and `worse than coin flip` (label flip). By using three different models, ranging from classic to state-of-the-art, we demonstrate the presence of shortcut learning in five simulated conditions. We analyze the results using a regression model to understand how biases affect the class-conditional score statistics.

* Interspeech 2023

Via

Access Paper or Ask Questions

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Mar 02, 2023
Xuechen Liu, Md Sahidullah, Tomi Kinnunen

Figure 1 for Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Figure 2 for Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Figure 3 for Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Figure 4 for Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Deep speaker models yield low error rates in speaker verification. Nonetheless, the high performance tends to be exchanged for model size and computation time, making these models challenging to run under limited conditions. We focus on small-footprint deep speaker embedding extraction, leveraging knowledge distillation. While prior work on this topic has addressed speaker embedding extraction at the utterance level, we propose to combine embeddings from various levels of the x-vector model (teacher network) to train small-footprint student networks. Results indicate the usefulness of frame-level information, with the student models being 85%-91% smaller than their teacher, depending on the size of the teacher embeddings. Concatenation of teacher embeddings results in student networks that reach comparable performance along with the teacher while utilizing a 75% relative size reduction from the teacher. The findings and analogies are furthered to other x-vector variants.

Via

Access Paper or Ask Questions

Speaker-Aware Anti-Spoofing

Mar 02, 2023
Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

Figure 1 for Speaker-Aware Anti-Spoofing

Figure 2 for Speaker-Aware Anti-Spoofing

Figure 3 for Speaker-Aware Anti-Spoofing

Figure 4 for Speaker-Aware Anti-Spoofing

We address speaker-aware anti-spoofing, where prior knowledge of the target speaker is incorporated into a voice spoofing countermeasure (CM). In contrast to the frequently used speaker-independent solutions, we train the CM in a speaker-conditioned way. As a proof of concept, we consider speaker-aware extension to the state-of-the-art AASIST (audio anti-spoofing using integrated spectro-temporal graph attention networks) model. To this end, we consider two alternative strategies to incorporate target speaker information at the frame and utterance levels, respectively. The experimental results on a custom protocol based on ASVspoof 2019 dataset indicates the efficiency of the speaker information via enrollment: we obtain maximum relative improvements of 25.1% and 11.6% in equal error rate (EER) and minimum tandem detection cost function (t-DCF) over a speaker-independent baseline, respectively.

Via

Access Paper or Ask Questions