Feed - arxlens

0

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

eess.AS cs.CL cs.SD Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · Mar 23, 2026

This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.

Read abstractHide abstract

0

SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

eess.AS cs.CL cs.SD Jianyi Chen, Rongxiu Zhong, Shilei Zhang et al. · Mar 22, 2026

This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.

Read abstractHide abstract

0

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

cs.SD cs.LG eess.AS Khushiyant, Param Thakkar · Mar 22, 2026

This paper studies the coupling between three design axes in audio representation learning: input frontend (raw waveform vs. spectrogram), backbone architecture (Mamba vs. attention), and sequence length. The authors introduce HELIX, a minimal hybrid architecture with five bidirectional Mamba layers and one attention bottleneck at matched 8.3M parameter capacity. The key finding is that these choices are not independent: raw waveforms help with Mamba but not attention, attention hurts on short environmental sounds but becomes critical at 30,000 tokens (5 minutes), where pure attention fails with OOM errors and HELIX closes an 11.5-point gap over pure Mamba on speaker identification.

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.

Read abstractHide abstract

0

TiCo: Time-Controllable Training for Spoken Dialogue Models

cs.CL cs.AI eess.AS Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al. · Mar 23, 2026

TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

Read abstractHide abstract

0

TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

cs.CL cs.LG eess.AS Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou et al. · Mar 23, 2026

This paper introduces TaigiSpeech, the first intent recognition dataset for Taiwanese Hokkien—a low-resource language spoken by 65% of Taiwanese elders. With 3,000+ utterances from 21 elderly speakers across emergency and smart-home scenarios, it addresses a critical gap in speech technology for aging populations. The authors also propose keyword-based and audio-visual mining strategies to bootstrap training data from unlabeled video sources.

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.

Read abstractHide abstract

0

DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

eess.AS cs.AI cs.SD Tianyu Cao, Helin Wang, Ari Frummer et al. · Mar 23, 2026

DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.

Read abstractHide abstract

Nothing here yet