The Universal Normal Embedding

cs.CV eess.IV Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa · Mar 23, 2026
Local to this browser
What it does
This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue...
Why it matters
The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of...
Main concern
The paper presents an intriguing unification hypothesis backed by solid empirical measurements on CelebA and AFHQ, but the bold "universal" claim outpaces the evidence. The authors convincingly demonstrate that Stable Diffusion noise...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of paired latents, to empirically test whether generative noise encodes semantic structure comparable to foundation encoders.

Critical review
Verdict
Bottom line

The paper presents an intriguing unification hypothesis backed by solid empirical measurements on CelebA and AFHQ, but the bold "universal" claim outpaces the evidence. The authors convincingly demonstrate that Stable Diffusion noise latents exhibit Gaussianity and linear separability comparable to CLIP/DINO on face attributes, supporting the narrower claim that diffusion models and encoders align on specific domains. However, the leap to a universal shared space across all generative and representational models remains speculative given the limited model diversity (primarily Stable Diffusion variants and CLIP family models) and datasets tested.

“We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections.”
paper · Abstract
What holds up

The NoiseZoo dataset construction is rigorous, providing per-image latents across multiple diffusion models and encoders with consistent preprocessing. Table 1's Gaussianity tests are compelling: generative models achieve ~96% acceptance rates on normality tests (approaching theoretical Gaussian thresholds), while encoders score ~85-92%. Figure 3 confirms that linear probes on DDIM-inverted noise achieve accuracy highly correlated with CLIP baselines, and the linear editing demonstrations (Figures 4-5) work without fine-tuning or prompt engineering. The shared space recovery via GCCA (Figure 6) shows consistent neighborhood structure across latent intersections.

“Generative models approach the theoretical 95% acceptance rate of Gaussian samples, encoders remain high, and non-Gaussian references perform substantially worse.”
paper · Table 1
“DDIM-inverted noise latents from SD 1.5, SD 2.1, and LCM achieve accuracy highly correlated with a strong encoder baseline (CLIP-B/16), despite originating from diffusion noise rather than semantic encoders.”
paper · Section 3.1
Main concerns

The scope is too narrow to support "universal" claims: experiments focus almost exclusively on CelebA (human faces) with limited AFHQ validation, and tested models are primarily Stable Diffusion variants (which share architecture and training data) plus CLIP variants. A critical confound exists because Stable Diffusion uses CLIP for text conditioning during training, potentially explaining alignment rather than convergent Gaussian geometry. The UNE hypothesis posits an invertible mapping $S \leftrightarrow \mathcal{N}(0,I_D)$ but provides no theoretical mechanism for why this specific Gaussian structure should emerge universally, only empirical correlation. The shared space recovery (Section 3.3) using GCCA is preliminary and admits it "may not recover the full UNE." Additionally, DDIM inversion introduces reconstruction artifacts and is not identity-preserving for real images, raising questions about whether observed semantics derive from the noise or the inversion process itself.

“Since each shared space is the intersection of its sources, it cannot contain more information than any single latent space.”
paper · Section 4.4
“For example, DDIM inversion can recover image-specific noise codes for a given diffusion model, but semantic editing in these models typically relies on external guidance.”
paper · Section 1
Evidence and comparison

The evidence supports the specific claim that diffusion noise and encoder embeddings align linearly on face attributes, but comparisons to related work require nuance. The paper correctly cites stitching literature (Badrinath et al., Lähner et al.) showing cross-model linear alignment, but understates that these works typically show alignment within model families (GAN-to-GAN or encoder-to-encoder) rather than across the generative/representational divide. The identifiability results (Zimmermann et al., Daunhawer et al.) they cite only establish that contrastive learning recovers latents up to linear or invertible transforms—not necessarily that these converge to a shared Gaussian space. The Platonic Representation Hypothesis (Huh et al.) is invoked as motivation, but that work focuses on representational convergence rather than explicit Gaussian geometry.

“We argue that representations in AI models, particularly deep networks, are converging... We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality.”
Reproducibility

Reproducibility is strong: code and the NoiseZoo dataset are publicly released. Appendix A provides detailed hyperparameters including DDIM inversion settings (50 steps, guidance scale 3.5, seed 42), PCA dimensions (500 for generative, 310 for encoders), and ridge regression regularization scaled by feature energy ($\alpha_{eff}=\alpha \|X_{source}\|_F^2/d$). The linear classifier training uses standard scikit-learn with specified solvers (saga) and iterations. However, the paper does not report confidence intervals or standard deviations for the accuracy metrics in Table 2, and the random 1D projection tests for Gaussianity (Section 4.1) sample only 250 data points per model, which may be undersized for high-dimensional spaces ($d \sim 16000$ for diffusion latents).

“Inversion was performed with an empty text prompt, classifier-free guidance enabled, a guidance scale of 3.5 and a fixed random seed (42).”
paper · Appendix A.1
“The effective ridge penalty was set to $\alpha_{eff}=\alpha \|X_{source}\|_F^2 / d$, where $\alpha$ is the base regularization parameter.”
paper · Appendix A.2
Abstract

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.