Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
9 papers in cs.SD
Trending mixes fresh papers with community signal.
0
cs.CLcs.SD Abner Hernandez, Eunjung Yeo, Kwanghee Choi et al. · Mar 23, 2026

Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.

The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
0
eess.AScs.CLcs.SD Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · Mar 23, 2026

This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.
0
cs.SDcs.LG Risa Shinoda, Kaede Shiohara, Nakamasa Inoue et al. · Mar 23, 2026

AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
0
eess.AScs.CLcs.SD Jianyi Chen, Rongxiu Zhong, Shilei Zhang et al. · Mar 22, 2026

This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
0
cs.SDcs.LGeess.AS Khushiyant, Param Thakkar · Mar 22, 2026

This paper studies the coupling between three design axes in audio representation learning: input frontend (raw waveform vs. spectrogram), backbone architecture (Mamba vs. attention), and sequence length. The authors introduce HELIX, a minimal hybrid architecture with five bidirectional Mamba layers and one attention bottleneck at matched 8.3M parameter capacity. The key finding is that these choices are not independent: raw waveforms help with Mamba but not attention, attention hurts on short environmental sounds but becomes critical at 30,000 tokens (5 minutes), where pure attention fails with OOM errors and HELIX closes an 11.5-point gap over pure Mamba on speaker identification.

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.
0
cs.SDcs.LG Kazuki Matsumoto, Ren Uchida, Kohei Yatabe · Mar 23, 2026

Existing Lipschitz-constrained DNNs don't directly apply to audio amplitude modifiers (AMs) because the complex-valued reconstruction breaks continuity. This paper proves that AMs are generally not Lipschitz continuous, derives sufficient conditions for Lipschitz continuity (Assumption 3), and proposes LipsAM architectures that enforce these bounds via element-wise minimum and ReLU operations. The work matters because it enables certified robust amplitude modification and stabilizes Plug-and-Play algorithms where conventional AMs diverge.

The robustness of deep neural networks (DNNs) can be certified through their Lipschitz continuity, which has made the construction of Lipschitz-continuous DNNs an active research field. However, DNNs for audio processing have not been a major focus due to their poor compatibility with existing results. In this paper, we consider the amplitude modifier (AM), a popular architecture for handling audio signals, and propose its Lipschitz-continuous variants, which we refer to as LipsAM. We prove a sufficient condition for an AM to be Lipschitz continuous and propose two architectures as examples of LipsAM. The proposed architectures were applied to a Plug-and-Play algorithm for speech dereverberation, and their improved stability is demonstrated through numerical experiments.
0
eess.AScs.AIcs.SD Tianyu Cao, Helin Wang, Ari Frummer et al. · Mar 23, 2026

DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
0
cs.CLcs.AIcs.SD Tianle Yang, Chengzhe Sun, Phil Rose et al. · Mar 22, 2026

This paper investigates whether neural text-to-speech systems capture consonant-induced F0 perturbation—fine-grained phonetic effects where voiceless obstruents raise and voiced obstruents lower fundamental frequency relative to sonorants. The authors propose a segmental-level prosodic probing framework comparing Tacotron 2 and FastSpeech 2 against natural speech, stratifying by lexical frequency to test memorization versus abstraction. This matters because TTS evaluation often misses sub-phonemic articulatory detail that distinguishes human-like phonetic competence from surface pattern matching.

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
0
cs.LGcs.AIcs.SD Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury et al. · Mar 22, 2026

This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.

Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.