Repurposing Geometric Foundation Models for Multi-view Diffusion
This paper proposes Geometric Latent Diffusion (GLD), a novel framework for novel view synthesis (NVS) that repurposes the feature space of geometric foundation models (specifically Depth Anything 3) as the latent space for multi-view diffusion. Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and zero-shot geometry decoding while accelerating training convergence by 4.4× compared to standard VAE spaces.
The paper presents a compelling and well-validated case for using geometric foundation model features as latent spaces for NVS. The identification of an optimal boundary layer (k=1) that balances photometric fidelity and geometric correspondence is methodically rigorous, and the quantitative improvements over VAE and DINO baselines—particularly in 3D consistency metrics—are substantial. However, the claim of training "from scratch" requires qualification, as the method relies heavily on pretrained geometric foundation models (DA3/VGGT) that encapsulate significant 3D priors, complicating direct comparisons with methods fine-tuned from text-to-image diffusion models.
The cascaded generation scheme and systematic boundary layer analysis (Table 2) demonstrate careful architectural design, with level 1 features optimally trading off between reconstruction quality (PSNR 25.36) and geometric correspondence (PCK 35.98). The RGB decoder validation confirms that DA3 features support high-fidelity reconstruction (PSNR 35.41, LPIPS 0.019), validating their suitability as a generative latent space. Most importantly, the improvements in 3D consistency metrics—specifically the $2.8\times$ reduction in ATE and $2.6\times$ reduction in RPE compared to VAE baselines—provide strong evidence that geometric latent spaces inherently mitigate pose drift and multi-view inconsistency.
The primary practical limitation is inference speed: GLD requires 66.1 seconds per scene compared to 28.0 seconds for VAE (Table 15), due to the cascaded two-stage sampling process. Comparisons with state-of-the-art methods like MVGenMaster are complicated by fundamental architectural differences—MVGenMaster uses explicit depth warping (which introduces distinct failure modes), while GLD embeds geometry implicitly through the latent space. Additionally, zero-shot generalization is limited to a single out-of-domain dataset (Mip-NeRF 360), and the reliance on DA3's pretrained geometric priors means the "from scratch" claim obscures significant transferred knowledge from larger pretraining datasets. The paper also omits discussion of failure cases beyond severe occlusion.
The evidence strongly supports claims relative to VAE and DINO baselines, with clear margins across PSNR, SSIM, and 3D consistency metrics (Table 3). However, comparisons to CAT3D and Matrix3D are less controlled—these methods use different architectures and are fine-tuned from large text-to-image models, while GLD uses a custom DiT$^{\text{DH}}$ architecture. The zero-shot geometry decoding claim is well-supported by ETH3D depth evaluation (Table 9) where GLD outperforms Matrix3D (AbsRel 0.160 vs 0.197), and the 4.4× training speedup is documented in Figure 1(c). The geometric correspondence analysis (Appendix D.1) showing DA3 latents produce stronger cross-view attention correlations provides mechanistic insight supporting the quantitative results.
The paper provides detailed architectural specifications (DiT$^{\text{DH}}$ with 28 encoder blocks, hidden dimensions 768/2048), hyperparameters (AdamW, lr $5\times 10^{-5}$, 175k iterations on 8 B200 GPUs), and dataset mixtures (Re10K:DL3DV:HyperSim:TartanAir at 4:4:1:1). The RGB decoder architecture and normalization statistics are clearly documented. However, while a project page URL is referenced, no explicit code repository link or checkpoint release is provided in the text, potentially limiting immediate reproducibility. The reliance on specific DA3 feature statistics (channel-wise normalization) and the cascaded inference procedure adds implementation complexity that could hinder exact reproduction without official code.
While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.