The Universal Normal Embedding
This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of paired latents, to empirically test whether generative noise encodes semantic structure comparable to foundation encoders.
The paper presents an intriguing unification hypothesis backed by solid empirical measurements on CelebA and AFHQ, but the bold "universal" claim outpaces the evidence. The authors convincingly demonstrate that Stable Diffusion noise latents exhibit Gaussianity and linear separability comparable to CLIP/DINO on face attributes, supporting the narrower claim that diffusion models and encoders align on specific domains. However, the leap to a universal shared space across all generative and representational models remains speculative given the limited model diversity (primarily Stable Diffusion variants and CLIP family models) and datasets tested.
The NoiseZoo dataset construction is rigorous, providing per-image latents across multiple diffusion models and encoders with consistent preprocessing. Table 1's Gaussianity tests are compelling: generative models achieve ~96% acceptance rates on normality tests (approaching theoretical Gaussian thresholds), while encoders score ~85-92%. Figure 3 confirms that linear probes on DDIM-inverted noise achieve accuracy highly correlated with CLIP baselines, and the linear editing demonstrations (Figures 4-5) work without fine-tuning or prompt engineering. The shared space recovery via GCCA (Figure 6) shows consistent neighborhood structure across latent intersections.
The scope is too narrow to support "universal" claims: experiments focus almost exclusively on CelebA (human faces) with limited AFHQ validation, and tested models are primarily Stable Diffusion variants (which share architecture and training data) plus CLIP variants. A critical confound exists because Stable Diffusion uses CLIP for text conditioning during training, potentially explaining alignment rather than convergent Gaussian geometry. The UNE hypothesis posits an invertible mapping $S \leftrightarrow \mathcal{N}(0,I_D)$ but provides no theoretical mechanism for why this specific Gaussian structure should emerge universally, only empirical correlation. The shared space recovery (Section 3.3) using GCCA is preliminary and admits it "may not recover the full UNE." Additionally, DDIM inversion introduces reconstruction artifacts and is not identity-preserving for real images, raising questions about whether observed semantics derive from the noise or the inversion process itself.
The evidence supports the specific claim that diffusion noise and encoder embeddings align linearly on face attributes, but comparisons to related work require nuance. The paper correctly cites stitching literature (Badrinath et al., Lähner et al.) showing cross-model linear alignment, but understates that these works typically show alignment within model families (GAN-to-GAN or encoder-to-encoder) rather than across the generative/representational divide. The identifiability results (Zimmermann et al., Daunhawer et al.) they cite only establish that contrastive learning recovers latents up to linear or invertible transforms—not necessarily that these converge to a shared Gaussian space. The Platonic Representation Hypothesis (Huh et al.) is invoked as motivation, but that work focuses on representational convergence rather than explicit Gaussian geometry.
Reproducibility is strong: code and the NoiseZoo dataset are publicly released. Appendix A provides detailed hyperparameters including DDIM inversion settings (50 steps, guidance scale 3.5, seed 42), PCA dimensions (500 for generative, 310 for encoders), and ridge regression regularization scaled by feature energy ($\alpha_{eff}=\alpha \|X_{source}\|_F^2/d$). The linear classifier training uses standard scikit-learn with specified solvers (saga) and iterations. However, the paper does not report confidence intervals or standard deviations for the accuracy metrics in Table 2, and the random 1D projection tests for Gaussianity (Section 4.1) sample only 250 data points per model, which may be undersized for high-dimensional spaces ($d \sim 16000$ for diffusion latents).
Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.