Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting
This paper challenges the monolithic paradigm in pose-free feed-forward 3D Gaussian Splatting (3DGS), where a single network jointly estimates camera poses and synthesizes Gaussians. The authors propose 2Xplat, a modular two-expert framework that decouples geometry estimation (using Depth Anything 3) from appearance synthesis (using Multi-view Pyramid Transformer) via an explicit pose interface. The core claim is that separating these concerns enables superior training efficiency (<5K iterations) and novel-view synthesis quality competitive with posed methods, challenging the assumption that unified architectures are optimal.
The paper presents a compelling empirical case that decoupling geometry and appearance via pretrained experts yields strong performance gains over monolithic pose-free approaches. The results on DL3DV and RE10K demonstrate significant PSNR improvements (over 3dB in some settings) with remarkable training efficiency. However, the conceptual contribution is incremental—the "two-expert" design is essentially a pipeline integration of existing pretrained models (DA3 and MVP) with lightweight fine-tuning, rather than a fundamentally new architectural insight. The claim that this design is "surprisingly underexplored" is overstated given prior works on staged geometry-appearance pipelines.
The quantitative results strongly support the efficacy of the proposed decomposition. On DL3DV with 12 views, 2Xplat achieves PSNR of 26.015 without ground-truth poses, compared to YoNoSplat's 20.383—an enormous gap that holds up across different view counts and resolutions. The training efficiency claim is well-substantiated: the authors contrast their 2K–5K iterations on 8 GPUs against YoNoSplat's 150K iterations on 16 GH200 GPUs. The ablation studies (Table 7) demonstrate that relative pose supervision effectively balances rendering quality and pose accuracy, and the cross-dataset generalization to ScanNet++ (Table 5) validates robustness beyond training distribution.
The primary limitation is the heavy reliance on existing, large-scale pretrained experts. The contribution resides largely in the integration strategy—connecting DA3 (geometry) to MVP (appearance)—rather than architectural innovation. This raises questions about whether the performance gains stem from the decoupled design itself or simply from leveraging higher-capacity pretrained models than baselines. Second, the method uses Evaluation-Time Pose Alignment (EPA) for reporting, which adds iterative optimization (100 iterations of Adam) during inference. While the authors report results both with and without EPA, the headline comparisons often include this post-processing, complicating the claim of pure feed-forward superiority. Third, the inference cost doubles relative to monolithic approaches since two large networks must run sequentially. Finally, the authors acknowledge in the limitations section that pose estimation accuracy remains slightly inferior to dedicated geometry methods, indicating the geometry expert does not fully benefit from the joint training.
The experimental evidence robustly supports the claim that the two-expert pipeline outperforms monolithic pose-free alternatives. However, the comparison to posed methods (MVP, DepthSplat) is nuanced: 2Xplat with EPA approaches posed MVP performance but does not consistently surpass it (e.g., 27.413 vs 27.73 PSNR on DL3DV 64v with ground-truth poses), suggesting the information bottleneck of predicted poses still imposes a ceiling. The authors fairly note that prior "geometry-first" approaches existed but argue their contribution differs by leveraging recent advances in high-capacity appearance models. This distinction is valid but modest—the cited works (Lai et al., Smith et al., Kang et al. SelfSplat) already explored similar sequential pipelines, differing mainly in training paradigm.
Reproducibility is moderately compromised by dependencies on specific pretrained checkpoints and substantial compute requirements. The method requires pretrained weights for both DA3 (geometry expert) and MVP (appearance expert), which must be publicly available for exact reproduction. Training demands 8 H200 GPUs (state-of-the-art hardware) for 2K–5K iterations, which is efficient relative to baselines but still resource-intensive. The authors provide implementation details including loss weights ($\lambda_{perc}=0.5$, $\lambda_{R}=0.1$, $\lambda_{t}=10$), optimizer settings (AdamW with $2\times 10^{-5}$ learning rate), and data splits, which aids reproduction. However, the paper does not mention code availability or release plans, and the reliance on two distinct complex codebases (DA3 and MVP) increases integration friction for independent reproduction.
Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.