DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

eess.AS cs.AI cs.SD Tianyu Cao, Helin Wang, Ari Frummer, Yuval Sieradzki, Adi Arbel, Laureano Moro Velazquez, Jesus Villalba, Oren Gal, Thomas Thebaud, Najim Dehak · Mar 23, 2026
Local to this browser
What it does
DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed...
Why it matters
The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA)...
Main concern
DiT-Flow is a solid contribution to generative speech enhancement that demonstrates flow matching can outperform diffusion-based baselines (SGMSE, StoRM) on multi-distortion benchmarks while achieving 2× better real-time factor (0. 230 vs...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.

Critical review
Verdict
Bottom line

DiT-Flow is a solid contribution to generative speech enhancement that demonstrates flow matching can outperform diffusion-based baselines (SGMSE, StoRM) on multi-distortion benchmarks while achieving 2× better real-time factor (0.230 vs ~0.5). The MoELoRA adaptation shows impressive parameter efficiency (4.9% of parameters) with minimal performance drop versus full fine-tuning. However, the claim of being the first to apply LoRA+MoE to generative SE is questionable given recent concurrent work, and the dataset contribution (StillSonicSet) is incremental—essentially reprocessing existing SonicSet RIRs for stationary rather than moving sources.

“Two diffusion-based models, SGMSE and StoRM, show relatively high real-time factors, which are more than twice of RTF for DiT-Flow.”
paper · Table VI
“MoELoRA(MLP+Attn) achieves 3.442 SIG, 3.991 BAK, 3.144 OVRL with only 4.9% parameters versus 100% for full finetune.”
paper · Table VII
What holds up

The core technical approach is sound: flow matching in latent space enables deterministic ODE-based inference that is faster than SDE-based diffusion methods. The evidence supports this—DiT-Flow achieves the best LSD (4.506) and competitive DNSMOS scores under the challenging Reverb+Noise+Codec-Compression condition (Table II). The cross-dataset generalization experiments (Table VI) provide strong evidence that StillSonicSet training transfers better to real-recorded data (LibriCSS, RealMAN) than WSJ0+Reverb, validating the acoustic realism of the proposed dataset. The MoELoRA results (Table VII) are compelling: with only 4.9% trainable parameters, it matches or exceeds full fine-tuning on perceptual metrics (SIG, BAK, OVRL).

“DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement.”
paper · Table II
“This isolates domain-specific learning to a small parameter set, preserving prior performance while making the update lightweight, efficient, and practical for rapid deployment.”
paper · Section IV-D
Main concerns

Several issues undermine the paper's strongest claims. First, the assertion that this is the first work applying LoRA with MoE to generative SE is poorly justified—recent papers like FlowSE (Wang et al., 2025, cited as [19]) also apply flow matching to SE, and the MoELoRA technique itself is borrowed from NLP without significant architectural innovation. Second, the dataset contribution is thin: StillSonicSet uses the exact same 90 Matterport3D scenes and RIRs as SonicSet, merely fixing source positions rather than moving them. Third, the model does not consistently dominate baselines—on Reverb-only conditions (Table III), SGMSE achieves higher PESQ (2.011 vs 1.599) and ESTOI (0.632 vs 0.578). Finally, the lack of ablation studies on critical hyperparameters (number of experts, LoRA rank, ODE solver steps) makes it unclear which design choices matter most.

“SGMSE performs best under the Reverb-only condition, achieving a PESQ score of 2.011, but declines to 1.35 in the Reverb + Noise + Compression condition.”
paper · Table III
“We first apply LoRA with the MoE framework to a generative speech enhancement system to adapt multiple distortions, achieving both parameter-efficient and high-performance training.”
paper · Section I
Evidence and comparison

The comparison methodology is mostly fair—baselines (SGMSE, StoRM) were retrained on the same StillSonicSet data rather than using pretrained checkpoints. However, the evaluation relies heavily on DNSMOS P.835 for perceptual quality, which the paper correctly notes is better suited for generative models than alignment-sensitive metrics like PESQ and ESTOI. The evidence for MoELoRA's superiority over standard LoRA is strong (Table VII), showing clear gains across all perceptual metrics when adapting to five unseen distortions (clipping, bandwidth limitation, codec loss, packet loss, wind noise). The paper could be clearer about why full finetuning underperforms MoELoRA on some metrics—this counterintuitive result suggests possible overfitting or optimization instability in full fine-tuning that deserves investigation.

“We trained those models on the same datasets mentioned in V-A using the authors' official settings.”
paper · Section V-C
“Full finetuning offers the best scores of PESQ and LSD but at high computational cost. In contrast, MoELoRA achieves a favorable balance between performance and efficiency.”
paper · Section VI-C
Reproducibility

Reproducibility is moderately good but has gaps. The paper provides detailed model configurations: VAE uses TF-GridNet with 3 blocks and 128-dimensional latents; the DiT backbone has 12 layers, 384 embedding dim, and 6 heads; flow matching uses 50 ODE solver steps; MoELoRA uses rank $r=8$ with scaling factor $\alpha=16$ and $k=3$ top experts from $N=5$ total. The audio compressor architecture (STFT-based, 40ms window, 20ms hop) is specified clearly. However, no code repository is linked in the paper, and the StillSonicSet dataset generation relies on SonicSim which requires access to Matterport3D—limiting reproducibility for researchers without those resources. The URGENT challenge dataset used for adaptation experiments is not yet publicly available (cited as ICASSP 2026), which limits independent verification of the MoELoRA results.

“The audio compressor consisted of a total of 49.3 million parameters. The target extractor consists of approximately 50.6 million parameters. During inference, the number of ODE solver steps was set to 50.”
paper · Section V-B
“Each block is augmented with a set of LoRA-based experts, implemented at the low rank $r=8$, together with a learned gating router. Model capacity and specialization are controlled by the experts with a number of 5.”
paper · Section V-B3
Abstract

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.