End-to-End Training for Unified Tokenization and Latent Denoising

cs.CV cs.AI cs.GR cs.LG Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman · Mar 23, 2026

What it does

Why it matters

UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1. 73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.

Critical review

Verdict

Bottom line

UNITE presents a compelling solution to the staging problem in latent diffusion, demonstrating that joint optimization of tokenization and generation via weight sharing is feasible and competitive. The stop-gradient mechanism and adversarial training dynamics are well-analyzed. However, the learned representations show limited discriminative capability (~30% linear probing accuracy), and the method still relies on LPIPS (requiring a pretrained VGG), somewhat weakening the claim of training without external supervision.

“linear probing accuracy...remains comparable to that of other generative tokenizers...at around 30%”

UNITE paper · Section 7

What holds up

The core technical insight—that tokenization and generation are the same latent inference problem under different conditioning regimes—holds up empirically. The weight-shared design outperforms separate encoder-denoiser configurations, with parameter tying yielding "the best overall rFID–gFID trade-off" (Fig. 5). The adversarial training dynamics are particularly insightful: improvements in generation can coincide with increasing denoising loss as the latent space becomes richer. The extension to QM9 molecular generation (99.37% reconstruction match) convincingly demonstrates applicability to domains lacking pretrained encoders.

“improvements in generative fidelity can even coincide with an increase in denoising loss...it often signals that the latent space is becoming richer and more informative”

UNITE paper · Section 3.3

“parameter tying yields the best overall reconstruction–generation trade-off”

UNITE paper · Section 4

Main concerns

While generation quality is strong, the modest linear probing accuracy suggests the latents are optimized for generation rather than discrimination, limiting downstream utility without further adaptation. The claim of training "without any external supervision" is partially undermined by the use of LPIPS, which requires a pretrained VGG network (the authors acknowledge this footnote). Comparisons to REPA/RAE methods that use DINOv2 supervision are asymmetric—those methods achieve better FID (e.g., 1.42 vs 1.73) but at higher computational cost. The ablation showing UNITE outperforms the concurrent Unified Latents (UL) work requires careful interpretation, as UL's best results reportedly require a second-stage fine-tuning step that UNITE avoids by design.

“Our ImageNet training uses LPIPS loss, which requires a pretrained VGG”

UNITE paper · Footnote 1

“Unlike RAE and REPA, which fundamentally rely on pretrained vision encoders...our single-stage approach reaches comparable performance”

UNITE paper · Section 5.1

Evidence and comparison

The evidence supports the claim that single-stage training matches two-stage approaches. Table 1 shows UNITE-L (589M params) achieves FID 1.73, surpassing DiT-XL/2 (724M params, FID 2.27). The ablation studies are rigorous: removing weight sharing degrades rFID from 1.01 to 1.38, and removing stop-gradient weakens representation alignment as measured by CKA (Fig. 6). The compression analysis (entropy measurements) provides mechanistic insight into why sharing works. However, FID improvements over methods like REPA-SiT-XL/2 (1.42) are not claimed; instead, the paper fairly positions UNITE as achieving strong results without the 15× computational overhead of DINOv2 pretraining.

“UNITE-L (Ours)...1.73...DiT-XL/2...2.27”

UNITE paper · Table 1

“UNITE-B (Ours)...1.01...w/ separate weights...1.38”

UNITE paper · Table 2

“This is approx. 15× cheaper than the end-to-end cost of methods that rely on pretrained DINOv2 encoders”

UNITE paper · Section 5.3

Reproducibility

Reproducibility is strong: code is available at GitHub, and Appendix D provides detailed architectural configurations (Table 7) and training hyperparameters (Table 8). The evaluation protocol specifies class-balanced sampling, torch-fidelity for FID, and exact ODE solver settings (dopri5). However, training requires substantial compute (6.7×10^20 FLOPs for UNITE-B) and the procedure is complex—each iteration requires 14 flow mini-batches per reconstruction step, complicating implementation compared to standard latent diffusion training. The stop-gradient and specific noise schedules (σ=0.7, shift α=0.5) appear critical to avoid degenerate solutions.

“Code: https://github.com/ShivamDuggal4/UNITE-tokenization-generation”

UNITE paper · Abstract

“UNITE-B requires approximately 6.7×10^20 FLOPs”

UNITE paper · Section 5.3

“Reconstruction Noise (σ)...0.7...Noise Schedule Shift (α)...0.5”

UNITE paper · Table 8

Abstract

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.