End-to-End Training for Unified Tokenization and Latent Denoising
Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.
UNITE presents a compelling solution to the staging problem in latent diffusion, demonstrating that joint optimization of tokenization and generation via weight sharing is feasible and competitive. The stop-gradient mechanism and adversarial training dynamics are well-analyzed. However, the learned representations show limited discriminative capability (~30% linear probing accuracy), and the method still relies on LPIPS (requiring a pretrained VGG), somewhat weakening the claim of training without external supervision.
The core technical insight—that tokenization and generation are the same latent inference problem under different conditioning regimes—holds up empirically. The weight-shared design outperforms separate encoder-denoiser configurations, with parameter tying yielding "the best overall rFID–gFID trade-off" (Fig. 5). The adversarial training dynamics are particularly insightful: improvements in generation can coincide with increasing denoising loss as the latent space becomes richer. The extension to QM9 molecular generation (99.37% reconstruction match) convincingly demonstrates applicability to domains lacking pretrained encoders.
While generation quality is strong, the modest linear probing accuracy suggests the latents are optimized for generation rather than discrimination, limiting downstream utility without further adaptation. The claim of training "without any external supervision" is partially undermined by the use of LPIPS, which requires a pretrained VGG network (the authors acknowledge this footnote). Comparisons to REPA/RAE methods that use DINOv2 supervision are asymmetric—those methods achieve better FID (e.g., 1.42 vs 1.73) but at higher computational cost. The ablation showing UNITE outperforms the concurrent Unified Latents (UL) work requires careful interpretation, as UL's best results reportedly require a second-stage fine-tuning step that UNITE avoids by design.
The evidence supports the claim that single-stage training matches two-stage approaches. Table 1 shows UNITE-L (589M params) achieves FID 1.73, surpassing DiT-XL/2 (724M params, FID 2.27). The ablation studies are rigorous: removing weight sharing degrades rFID from 1.01 to 1.38, and removing stop-gradient weakens representation alignment as measured by CKA (Fig. 6). The compression analysis (entropy measurements) provides mechanistic insight into why sharing works. However, FID improvements over methods like REPA-SiT-XL/2 (1.42) are not claimed; instead, the paper fairly positions UNITE as achieving strong results without the 15× computational overhead of DINOv2 pretraining.
Reproducibility is strong: code is available at GitHub, and Appendix D provides detailed architectural configurations (Table 7) and training hyperparameters (Table 8). The evaluation protocol specifies class-balanced sampling, torch-fidelity for FID, and exact ODE solver settings (dopri5). However, training requires substantial compute (6.7×10^20 FLOPs for UNITE-B) and the procedure is complex—each iteration requires 14 flow mini-batches per reconstruction step, complicating implementation compared to standard latent diffusion training. The stop-gradient and specific noise schedules (σ=0.7, shift α=0.5) appear critical to avoid degenerate solutions.
Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.