Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

eess.AS cs.CL cs.SD Xi Xuan, Wenxin Zhang, Zhiyu Li, Jennifer Williams, Ville Hautam\"aki, Tomi H. Kinnunen · Mar 23, 2026
Local to this browser
What it does
This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev...
Why it matters
The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source...
Main concern
The paper presents a novel and theoretically motivated approach to an understudied problem: speaker-source entanglement in deepfake verification. The dual-loss architecture combining Chebyshev polynomials ($\mathcal{F}_{\text{cheb}}$) and...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.

Critical review
Verdict
Bottom line

The paper presents a novel and theoretically motivated approach to an understudied problem: speaker-source entanglement in deepfake verification. The dual-loss architecture combining Chebyshev polynomials ($\mathcal{F}_{\text{cheb}}$) and hyperbolic projections ($\tilde{f}_{i}^{\text{src}}$ onto Poincaré ball) shows consistent empirical gains across four encoder architectures and four evaluation protocols. However, the core claim of disentanglement relies primarily on t-SNE visualizations rather than cross-task generalization tests for the proposed methods, and several citations point to non-existent future publications (2026), raising concerns about the theoretical grounding validation.

“The pilot experiment... demonstrates that this is not the case, indicating that the embeddings learned for each task still retain substantial information about the other task”
paper · Section 3
“RiemannSD-AAM yields the best overall results... EER(%) 3.27 (±0.10) for ResNet34 average”
paper · Table 2
What holds up

The experimental design is robust, testing across ECAPA-TDNN, ResNet34, AASIST, and Mamba encoders with consistent improvements using both ChebySD-AAM and RiemannSD-AAM losses. The ablation study (Table 4) clearly demonstrates that removing speaker disentanglement ($\times$) degrades performance on unseen sources, confirming the necessity of the approach. The t-SNE visualization in Figure 2 provides compelling qualitative evidence that RiemannSD-AAM separates source clusters while dispersing speaker identity clusters.

“removing speaker disentanglement leads to notable performance degradation, confirming its necessity in the challenging unseen source scenarios”
paper · Section 5.2
“the proposed speaker-disentangled loss functions consistently outperform the baseline (AAM-Softmax), regardless of the synthetic source encoder employed”
paper · Section 5.1
Main concerns

The pilot cross-task evaluation (Table 1) effectively demonstrates entanglement in baseline models, yet the paper fails to report cross-task performance (speaker verification EER using source embeddings) for the proposed ChebySD-AAM and RiemannSD-AAM methods. Without this, the claim of disentanglement relies on geometric intuition rather than empirical verification that source encoders no longer retain speaker information. Additionally, citations to Wang et al. (2026) and Fang et al. (2026) appear to be erroneous or refer to unpublished work, undermining the theoretical foundation claims regarding Chebyshev and Riemannian approaches. The pseudo-speaker labeling using cosine similarity threshold 0.5 lacks validation against ground truth speaker identities.

“High cross-task performance between Task1 (speaker verification) and Task2 (source verification) reveals strong entanglement... motivating speaker disentanglement”
paper · Table 1
“we leverage pseudo-speaker labels via cosine similarity between speaker embeddings... selected as the threshold to obtain binary (same/different speaker) trial keys”
paper · Section 4.2
Evidence and comparison

The evidence supports internal comparisons showing ChebySD-AAM and RiemannSD-AAM outperform standard AAM-Softmax, with Riemannian geometry providing superior results (EER 4.08% vs 7.24% baseline on P-III for ResNet34). However, comparisons to alternative disentanglement techniques—such as gradient reversal layers mentioned in Dao et al. (2026) or adversarial training—are absent. The paper positions itself as the first work on speaker-disentangled source verification, but fails to demonstrate whether simpler baselines (e.g., multi-task learning with gradient reversal) would achieve similar gains, leaving open whether the mathematical complexity provides practical advantages over standard domain adaptation techniques.

“removal of speaker information results in a substantial performance degradation... we focus on designing a speaker-disentangling framework by integrating metric learning with two novel loss functions”
paper · Section 2.2
“Baseline [deng2019arcface] (×): 7.24 (±0.22) vs RiemannSD-AAM (✓): 4.08 (±0.14)”
paper · Table 4
Reproducibility

Reproducibility is moderately strong: the authors承诺 open-source code, evaluation protocols, and demo website (though the GitHub URL appears malformed as "111https://github.com/xxuan-acoustics/RiemannSD-Net"). Implementation details specify PyTorch Lightning, SpeechBrain, ReDimNet-B6 for speaker embeddings, and hyperparameters including $s=30$, $m=0.3$, $\tau=0.1$, $\gamma=2$, and curvature $c=6$. The MLAAD v8 dataset is publicly available. However, the sensitivity analysis (Table 4) shows performance remains relatively stable across wide parameter ranges ($K=5$ vs $K=20$, $\lambda=0.1$ vs $\lambda=10$), suggesting the method is not overly sensitive to exact tuning, though the ablation lacks statistical significance testing for the differences.

“The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net”
paper · Abstract
“We set $\tau=0.1$ (ChebySD-AAM) and $\gamma=2$ (RiemannSD-AAM)... batch size is 200”
paper · Section 4.3
Abstract

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.