Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models

cs.CV Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, Shuhang Gu · Mar 22, 2026

What it does

Why it matters

The work achieves state-of-the-art FID 1. 18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses a subtle but critical issue in latent diffusion models (LDMs): VAE tokenizers tend to collapse latent variance toward zero to minimize reconstruction error, creating overly compact manifolds that are brittle against sampling perturbations. The authors propose a Variance Expansion (VE) loss that adaptively counteracts this collapse via an inverse-variance term $\mathcal{L}_{\text{var}} = 1/(\sigma^2 + \delta)$, allowing the latent space to absorb stochastic diffusion noise while maintaining reconstruction fidelity. The work achieves state-of-the-art FID 1.18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.

Critical review

Verdict

Bottom line

The paper presents a technically sound and practically impactful contribution. The theoretical analysis rigorously establishes why reconstruction loss induces variance collapse (via the decoder Jacobian trace term $T(\mu)$), and the proposed VE loss provides a principled mechanism to achieve adaptive, locally-varying variance. The empirical results are strong—the method consistently improves FID scores across vanilla LDM, VAVAE, and LightningDiT backbones, reaching 1.18 FID on ImageNet 256$\times$256 with CFG. The claim that KL regularization is unnecessary for LDMs is well-supported by Appendix A's argument that diffusion learns an implicit prior. However, the framing of 'adversarial interplay' is slightly misleading as the losses are simply summed rather than adversarially trained, and the reconstruction degradation (PSNR drops from 27.71 to 26.31 in Table 2) is downplayed despite being a measurable trade-off.

“\mathcal{L}(\mu,\sigma)\approx\|\mathbf{X}_{0}-\mathcal{D}(\mu)\|^{2}+\sigma^{2}T(\mu)”

Section 4.1, Eq. 8 · Section 4.1

“\mathcal{L}_{\text{var}}(\sigma)=\frac{1}{\sigma^{2}+\delta}”

Section 4.2, Eq. 9 · Section 4.2

“The marginal distribution p_{\theta}(z_{0}) over the encoder's latent space is therefore learned by the diffusion model itself rather than fixed a priori. Consequently, constraining q_{\phi}(z_{0}\mid x) to follow a Gaussian distribution is both unnecessary and potentially harmful”

Appendix A · Appendix A

What holds up

The theoretical derivation of variance collapse via first-order Taylor expansion of the decoder is rigorous and explains the phenomenon better than prior heuristic observations. The toy example (Figure 1) effectively visualizes how compact latent manifolds lead to sampling failures. The equilibrium analysis showing $\sigma = (\lambda/T(\mu))^{1/4}$ (Eq. 14) demonstrates that the method adapts variance to local decoder sensitivity, which is a significant advantage over fixed KL regularization. The comprehensive ablation in Table 2 demonstrating that sweeping KL weights from $10^{-6}$ to $8$ cannot achieve VE's performance (FID 18.90 vs best KL FID 22.87) is convincing evidence that the method is not merely a hyperparameter tuning of the KL term.

“\frac{\partial\mathcal{L}_{\text{rec}}}{\partial\sigma}\approx 2\sigma T(\mu)”

Section 4.1, Eq. 12-13 · Section 4.1

“\sigma^{4}=\frac{\lambda}{T(\mu)}\quad\Longrightarrow\quad\sigma=\left(\frac{\lambda}{T(\mu)}\right)^{1/4}”

Section 4.2, Eq. 14 · Section 4.2

“++ VE Loss | 0.06 | 0.46 | 26.31 | 18.90”

Table 2 · Table 2

Main concerns

The method introduces three additional hyperparameters ($\lambda_1, \lambda_2, \tau$) with values (0.1, $10^{-6}$, 1) justified only as 'empirically set' without sensitivity analysis—this raises concerns about transferability to other datasets or resolutions. The paper claims VE loss maintains 'strong reconstruction fidelity,' but Table 4 shows their method achieves PSNR 28.31 while VA-VAE achieves 27.71—actually slightly better, not worse—yet Table 2 shows VE achieves PSNR 26.31 versus KL-regularized baselines achieving 26.16-27.12, indicating the reconstruction-quality-preserving claim is dataset/setting-dependent. The comparison with concurrent RAE work feels defensive; while they acknowledge RAE uses a '$\sigma$-VAE-like' approach, they understate that RAE achieved similar FID (1.41) without requiring tokenizer retraining. The 'adversarial interplay' description is hyperbolic—the losses are linearly combined, not adversarial.

“++ VE Loss | 0.06 | 0.46 | 26.31 | 18.90”

Table 2 · Table 2

“++ VE Loss | ++ 10 | 0.26 | 28.31 | 0.090 | 0.792”

Table 4 · Table 4

“Empirically, we set the variance expansion loss weight $\lambda_{1}$, the regularization loss weight $\lambda_{2}$, and the threshold-like parameter $\tau$ to 0.1, 1\times 10^{-6}, and 1, respectively”

Section 5.1 · Section 5.1

Evidence and comparison

The evidence strongly supports the core claim that VE loss improves diffusion sampling quality. Table 1 demonstrates consistent FID improvements across both vanilla LDM ($\Delta$FID -2.55 on DiT-B) and foundation-model-aligned VAVAE ($\Delta$FID -4.43 on LightningDiT-B). The comparison to prior work is generally fair—Table 3 includes comprehensive SOTA baselines (REPA, REG, RAE, MAR, VAR) and shows their method achieves the best FID (1.18) with fewer training epochs (530 vs 800 for most competitors). However, they omit direct comparison with the '$\sigma$-VAE' baseline from Sun et al. in the main experiments, only mentioning it in related work. The claim that 'tuning the KL regularization coefficient alone cannot fundamentally resolve the problem' is well-supported by Table 2, which shows FID degrades as KL weight increases beyond $10^{-2}$ despite improving variance.

“VAVAE+VE loss | 16 | 0.45 | 26.54 | 0.118 | 0.74 | 19.42 | 15.50 | 12.89”

Table 1 · Table 1

“Ours | 530 | 675M | 1.18 | 4.29 | 289.8 | 0.78 | 0.66”

Table 3 · Table 3

“\sigma-VAE adopts a fixed variance design to inject controlled stochasticity into the latent representation”

Section 2.2 · Section 2.2

Reproducibility

Reproducibility is moderately good but has gaps. The authors provide a GitHub repository (github.com/CVL-UESTC/VE-Loss) and detailed training protocols in Section 5.1, including batch sizes (64 for tokenizer, 1024 for DiT), learning rates ($2.5\times 10^{-5}$ for tokenizer, $2\times 10^{-4}$ for DiT), and optimizer settings (AdamW with $\beta_1=0.5, \beta_2=0.9$). However, the critical switch to the Muon optimizer for long-horizon training of DINOv2-aligned spaces—a key detail for reproducing their best results—is mentioned only in Appendix B without full hyperparameters. The toy experiment details (Appendix C) are thorough, but the exact VAVAE fine-tuning protocol (only 5 epochs vs standard 50/130) lacks specifics on learning rate schedules or whether other hyperparameters were adjusted. The paper mentions 'limited computational resources' for some experiments but does not specify GPU-hours or training time, making cost-based reproducibility assessment difficult.

“Training is performed on eight NVIDIA RTX 4090 GPUs with a global batch size of 64... All models are optimized using the AdamW optimizer with $\beta_{1}=0.5$, $\beta_{2}=0.9$ and a learning rate of $2.5\times 10^{-5}$”

Section 5.1 · Section 5.1

“we observed that using the Muon optimizer substantially alleviates this issue. Therefore, for all long-horizon training runs in this work, we adopt Muon as our default optimizer”

Appendix B · Appendix B

“For the state-of-art one, due to limited computational resources, we fine-tune VA-VAE for 5 epochs”

Section 5.1 · Section 5.1

Abstract

Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.