Nothing here yet
Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.
This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.
AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.
This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.
This paper studies the coupling between three design axes in audio representation learning: input frontend (raw waveform vs. spectrogram), backbone architecture (Mamba vs. attention), and sequence length. The authors introduce HELIX, a minimal hybrid architecture with five bidirectional Mamba layers and one attention bottleneck at matched 8.3M parameter capacity. The key finding is that these choices are not independent: raw waveforms help with Mamba but not attention, attention hurts on short environmental sounds but becomes critical at 30,000 tokens (5 minutes), where pure attention fails with OOM errors and HELIX closes an 11.5-point gap over pure Mamba on speaker identification.
Existing Lipschitz-constrained DNNs don't directly apply to audio amplitude modifiers (AMs) because the complex-valued reconstruction breaks continuity. This paper proves that AMs are generally not Lipschitz continuous, derives sufficient conditions for Lipschitz continuity (Assumption 3), and proposes LipsAM architectures that enforce these bounds via element-wise minimum and ReLU operations. The work matters because it enables certified robust amplitude modification and stabilizes Plug-and-Play algorithms where conventional AMs diverge.
DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.
This paper investigates whether neural text-to-speech systems capture consonant-induced F0 perturbation—fine-grained phonetic effects where voiceless obstruents raise and voiced obstruents lower fundamental frequency relative to sonorants. The authors propose a segmental-level prosodic probing framework comparing Tacotron 2 and FastSpeech 2 against natural speech, stratifying by lexical frequency to test memorization versus abstraction. This matters because TTS evaluation often misses sub-phonemic articulatory detail that distinguishes human-like phonetic competence from surface pattern matching.
This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.