Repurposing Geometric Foundation Models for Multi-view Diffusion

cs.CV Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu · Mar 23, 2026

What it does

Why it matters

Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper proposes Geometric Latent Diffusion (GLD), a novel framework for novel view synthesis (NVS) that repurposes the feature space of geometric foundation models (specifically Depth Anything 3) as the latent space for multi-view diffusion. Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and zero-shot geometry decoding while accelerating training convergence by 4.4× compared to standard VAE spaces.

Critical review

Verdict

Bottom line

The paper presents a compelling and well-validated case for using geometric foundation model features as latent spaces for NVS. The identification of an optimal boundary layer (k=1) that balances photometric fidelity and geometric correspondence is methodically rigorous, and the quantitative improvements over VAE and DINO baselines—particularly in 3D consistency metrics—are substantial. However, the claim of training "from scratch" requires qualification, as the method relies heavily on pretrained geometric foundation models (DA3/VGGT) that encapsulate significant 3D priors, complicating direct comparisons with methods fine-tuned from text-to-image diffusion models.

“synthesizing up to level 1 achieves superior NVS performance”

Boundary Layer Evaluation · Section 4.3.2

“GLD consistently outperforms both baselines in PSNR, SSIM, and LPIPS across all benchmarks”

Table 3 · Section 5.2

What holds up

The cascaded generation scheme and systematic boundary layer analysis (Table 2) demonstrate careful architectural design, with level 1 features optimally trading off between reconstruction quality (PSNR 25.36) and geometric correspondence (PCK 35.98). The RGB decoder validation confirms that DA3 features support high-fidelity reconstruction (PSNR 35.41, LPIPS 0.019), validating their suitability as a generative latent space. Most importantly, the improvements in 3D consistency metrics—specifically the $2.8\times$ reduction in ATE and $2.6\times$ reduction in RPE compared to VAE baselines—provide strong evidence that geometric latent spaces inherently mitigate pose drift and multi-view inconsistency.

“PSNR 35.41, SSIM 0.960, LPIPS 0.019”

Table 1 · Section 4.2

“GLD achieves up to a 2.8× lower ATE and a 2.6× lower RPE compared to the baselines”

Table 3 · Section 5.2.2

Main concerns

The primary practical limitation is inference speed: GLD requires 66.1 seconds per scene compared to 28.0 seconds for VAE (Table 15), due to the cascaded two-stage sampling process. Comparisons with state-of-the-art methods like MVGenMaster are complicated by fundamental architectural differences—MVGenMaster uses explicit depth warping (which introduces distinct failure modes), while GLD embeds geometry implicitly through the latent space. Additionally, zero-shot generalization is limited to a single out-of-domain dataset (Mip-NeRF 360), and the reliance on DA3's pretrained geometric priors means the "from scratch" claim obscures significant transferred knowledge from larger pretraining datasets. The paper also omits discussion of failure cases beyond severe occlusion.

“VAE 28.0 (s), GLD (ours) 66.1 (s)”

Inference Latency · Appendix D.2, Table 15

“In cases of severe occlusion or very sparse spatial coverage, the model may hallucinate content”

Limitations · Appendix D.3

Evidence and comparison

The evidence strongly supports claims relative to VAE and DINO baselines, with clear margins across PSNR, SSIM, and 3D consistency metrics (Table 3). However, comparisons to CAT3D and Matrix3D are less controlled—these methods use different architectures and are fine-tuned from large text-to-image models, while GLD uses a custom DiT$^{\text{DH}}$ architecture. The zero-shot geometry decoding claim is well-supported by ETH3D depth evaluation (Table 9) where GLD outperforms Matrix3D (AbsRel 0.160 vs 0.197), and the 4.4× training speedup is documented in Figure 1(c). The geometric correspondence analysis (Appendix D.1) showing DA3 latents produce stronger cross-view attention correlations provides mechanistic insight supporting the quantitative results.

“GLD (Ours) AbsRel 0.160, Matrix3D 0.197”

Table 9 · Section 5.7.1

“GLD converges 4.4× faster than VAE”

Figure 1 · Introduction

Reproducibility

The paper provides detailed architectural specifications (DiT$^{\text{DH}}$ with 28 encoder blocks, hidden dimensions 768/2048), hyperparameters (AdamW, lr $5\times 10^{-5}$, 175k iterations on 8 B200 GPUs), and dataset mixtures (Re10K:DL3DV:HyperSim:TartanAir at 4:4:1:1). The RGB decoder architecture and normalization statistics are clearly documented. However, while a project page URL is referenced, no explicit code repository link or checkpoint release is provided in the text, potentially limiting immediate reproducibility. The reliance on specific DA3 feature statistics (channel-wise normalization) and the cascaded inference procedure adds implementation complexity that could hinder exact reproduction without official code.

“Condition encoder: 28 blocks, hidden dimension 768; Velocity decoder: hidden dimension 2048”

Architecture Details · Appendix A.2.3

“AdamW with a fixed learning rate of $5\times 10^{-5}$, batch size of 48, trained on 8 B200 GPUs for 175k iterations”

Training Details · Appendix A.1

Abstract

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.