DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.
DiT-Flow is a solid contribution to generative speech enhancement that demonstrates flow matching can outperform diffusion-based baselines (SGMSE, StoRM) on multi-distortion benchmarks while achieving 2× better real-time factor (0.230 vs ~0.5). The MoELoRA adaptation shows impressive parameter efficiency (4.9% of parameters) with minimal performance drop versus full fine-tuning. However, the claim of being the first to apply LoRA+MoE to generative SE is questionable given recent concurrent work, and the dataset contribution (StillSonicSet) is incremental—essentially reprocessing existing SonicSet RIRs for stationary rather than moving sources.
The core technical approach is sound: flow matching in latent space enables deterministic ODE-based inference that is faster than SDE-based diffusion methods. The evidence supports this—DiT-Flow achieves the best LSD (4.506) and competitive DNSMOS scores under the challenging Reverb+Noise+Codec-Compression condition (Table II). The cross-dataset generalization experiments (Table VI) provide strong evidence that StillSonicSet training transfers better to real-recorded data (LibriCSS, RealMAN) than WSJ0+Reverb, validating the acoustic realism of the proposed dataset. The MoELoRA results (Table VII) are compelling: with only 4.9% trainable parameters, it matches or exceeds full fine-tuning on perceptual metrics (SIG, BAK, OVRL).
Several issues undermine the paper's strongest claims. First, the assertion that this is the first work applying LoRA with MoE to generative SE is poorly justified—recent papers like FlowSE (Wang et al., 2025, cited as [19]) also apply flow matching to SE, and the MoELoRA technique itself is borrowed from NLP without significant architectural innovation. Second, the dataset contribution is thin: StillSonicSet uses the exact same 90 Matterport3D scenes and RIRs as SonicSet, merely fixing source positions rather than moving them. Third, the model does not consistently dominate baselines—on Reverb-only conditions (Table III), SGMSE achieves higher PESQ (2.011 vs 1.599) and ESTOI (0.632 vs 0.578). Finally, the lack of ablation studies on critical hyperparameters (number of experts, LoRA rank, ODE solver steps) makes it unclear which design choices matter most.
The comparison methodology is mostly fair—baselines (SGMSE, StoRM) were retrained on the same StillSonicSet data rather than using pretrained checkpoints. However, the evaluation relies heavily on DNSMOS P.835 for perceptual quality, which the paper correctly notes is better suited for generative models than alignment-sensitive metrics like PESQ and ESTOI. The evidence for MoELoRA's superiority over standard LoRA is strong (Table VII), showing clear gains across all perceptual metrics when adapting to five unseen distortions (clipping, bandwidth limitation, codec loss, packet loss, wind noise). The paper could be clearer about why full finetuning underperforms MoELoRA on some metrics—this counterintuitive result suggests possible overfitting or optimization instability in full fine-tuning that deserves investigation.
Reproducibility is moderately good but has gaps. The paper provides detailed model configurations: VAE uses TF-GridNet with 3 blocks and 128-dimensional latents; the DiT backbone has 12 layers, 384 embedding dim, and 6 heads; flow matching uses 50 ODE solver steps; MoELoRA uses rank $r=8$ with scaling factor $\alpha=16$ and $k=3$ top experts from $N=5$ total. The audio compressor architecture (STFT-based, 40ms window, 20ms hop) is specified clearly. However, no code repository is linked in the paper, and the StillSonicSet dataset generation relies on SonicSim which requires access to Matterport3D—limiting reproducibility for researchers without those resources. The URGENT challenge dataset used for adaptation experiments is not yet publicly available (cited as ICASSP 2026), which limits independent verification of the MoELoRA results.
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.