Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
This paper addresses a subtle but critical issue in latent diffusion models (LDMs): VAE tokenizers tend to collapse latent variance toward zero to minimize reconstruction error, creating overly compact manifolds that are brittle against sampling perturbations. The authors propose a Variance Expansion (VE) loss that adaptively counteracts this collapse via an inverse-variance term $\mathcal{L}_{\text{var}} = 1/(\sigma^2 + \delta)$, allowing the latent space to absorb stochastic diffusion noise while maintaining reconstruction fidelity. The work achieves state-of-the-art FID 1.18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.
The paper presents a technically sound and practically impactful contribution. The theoretical analysis rigorously establishes why reconstruction loss induces variance collapse (via the decoder Jacobian trace term $T(\mu)$), and the proposed VE loss provides a principled mechanism to achieve adaptive, locally-varying variance. The empirical results are strong—the method consistently improves FID scores across vanilla LDM, VAVAE, and LightningDiT backbones, reaching 1.18 FID on ImageNet 256$\times$256 with CFG. The claim that KL regularization is unnecessary for LDMs is well-supported by Appendix A's argument that diffusion learns an implicit prior. However, the framing of 'adversarial interplay' is slightly misleading as the losses are simply summed rather than adversarially trained, and the reconstruction degradation (PSNR drops from 27.71 to 26.31 in Table 2) is downplayed despite being a measurable trade-off.
The theoretical derivation of variance collapse via first-order Taylor expansion of the decoder is rigorous and explains the phenomenon better than prior heuristic observations. The toy example (Figure 1) effectively visualizes how compact latent manifolds lead to sampling failures. The equilibrium analysis showing $\sigma = (\lambda/T(\mu))^{1/4}$ (Eq. 14) demonstrates that the method adapts variance to local decoder sensitivity, which is a significant advantage over fixed KL regularization. The comprehensive ablation in Table 2 demonstrating that sweeping KL weights from $10^{-6}$ to $8$ cannot achieve VE's performance (FID 18.90 vs best KL FID 22.87) is convincing evidence that the method is not merely a hyperparameter tuning of the KL term.
The method introduces three additional hyperparameters ($\lambda_1, \lambda_2, \tau$) with values (0.1, $10^{-6}$, 1) justified only as 'empirically set' without sensitivity analysis—this raises concerns about transferability to other datasets or resolutions. The paper claims VE loss maintains 'strong reconstruction fidelity,' but Table 4 shows their method achieves PSNR 28.31 while VA-VAE achieves 27.71—actually slightly better, not worse—yet Table 2 shows VE achieves PSNR 26.31 versus KL-regularized baselines achieving 26.16-27.12, indicating the reconstruction-quality-preserving claim is dataset/setting-dependent. The comparison with concurrent RAE work feels defensive; while they acknowledge RAE uses a '$\sigma$-VAE-like' approach, they understate that RAE achieved similar FID (1.41) without requiring tokenizer retraining. The 'adversarial interplay' description is hyperbolic—the losses are linearly combined, not adversarial.
The evidence strongly supports the core claim that VE loss improves diffusion sampling quality. Table 1 demonstrates consistent FID improvements across both vanilla LDM ($\Delta$FID -2.55 on DiT-B) and foundation-model-aligned VAVAE ($\Delta$FID -4.43 on LightningDiT-B). The comparison to prior work is generally fair—Table 3 includes comprehensive SOTA baselines (REPA, REG, RAE, MAR, VAR) and shows their method achieves the best FID (1.18) with fewer training epochs (530 vs 800 for most competitors). However, they omit direct comparison with the '$\sigma$-VAE' baseline from Sun et al. in the main experiments, only mentioning it in related work. The claim that 'tuning the KL regularization coefficient alone cannot fundamentally resolve the problem' is well-supported by Table 2, which shows FID degrades as KL weight increases beyond $10^{-2}$ despite improving variance.
Reproducibility is moderately good but has gaps. The authors provide a GitHub repository (github.com/CVL-UESTC/VE-Loss) and detailed training protocols in Section 5.1, including batch sizes (64 for tokenizer, 1024 for DiT), learning rates ($2.5\times 10^{-5}$ for tokenizer, $2\times 10^{-4}$ for DiT), and optimizer settings (AdamW with $\beta_1=0.5, \beta_2=0.9$). However, the critical switch to the Muon optimizer for long-horizon training of DINOv2-aligned spaces—a key detail for reproducing their best results—is mentioned only in Appendix B without full hyperparameters. The toy experiment details (Appendix C) are thorough, but the exact VAVAE fine-tuning protocol (only 5 epochs vs standard 50/130) lacks specifics on learning rate schedules or whether other hyperparameters were adjusted. The paper mentions 'limited computational resources' for some experiments but does not specify GPU-hours or training time, making cost-based reproducibility assessment difficult.
Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.