Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images

cs.CV Jiatong Xia, Lingqiao Liu · Mar 22, 2026
Local to this browser
What it does
The paper presents a training-free pipeline for reconstructing instance-aware 3D scenes from 10-20 unposed RGB images and rendering novel views using diffusion. It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based...
Why it matters
It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based cross-view unification, and the See3D diffusion model for inpainting holes in point-cloud projections. The system enables object-level editing by manipulating...
Main concern
The paper proposes a practical integration of existing foundation models (MV-DUSt3R, SAM, See3D) into a cohesive pipeline for sparse-view 3D reconstruction and editing. While the system demonstrates strong empirical results on small-scale...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper presents a training-free pipeline for reconstructing instance-aware 3D scenes from 10-20 unposed RGB images and rendering novel views using diffusion. It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based cross-view unification, and the See3D diffusion model for inpainting holes in point-cloud projections. The system enables object-level editing by manipulating the point cloud directly, avoiding per-scene optimization.

Critical review
Verdict
Bottom line

The paper proposes a practical integration of existing foundation models (MV-DUSt3R, SAM, See3D) into a cohesive pipeline for sparse-view 3D reconstruction and editing. While the system demonstrates strong empirical results on small-scale benchmarks, the technical novelty is limited: the warping-based anomaly removal (depth consistency thresholding with $\tau = 3/4$) and instance unification (mIoU matching with $\eta = 1/3$) are standard multi-view stereo techniques rather than algorithmic breakthroughs. The rendering quality depends entirely on See3D, making the method an application paper rather than a contribution to core reconstruction or rendering methodology.

What holds up

The end-to-end pipeline is sound and the experiments demonstrate clear improvements over prior work. The warping-based consistency checks do improve geometry quality, as shown by the depth metrics (RMSE 0.209 vs MV-DUSt3R+ 0.219 in Table 2). The instance segmentation results (AP 25.0 vs PE3R's 20.2 in Table 1) validate that geometric warping outperforms temporal tracking (SAM2) for sparse, non-sequential views. The ablation showing a 15-point AP50 drop without warping-based unification (Tab. 1) confirms the importance of cross-view geometric fusion.

“Ours: 25.0 AP, 41.8 AP50, 25.4 AP75; PE3R: 20.2 AP, 32.6 AP50, 22.3 AP75”
Xia & Liu, Sec. 4.2 · Table 1
“w/o Warping-based unifi.: 17.0 AP, 26.8 AP50, 18.1 AP75”
Xia & Liu, Sec. 4.3 · Table 1, w/o Warping-based unifi.
Main concerns

The rendering module inherits all limitations of See3D: it is slow (2.5 minutes per view vs real-time for 3DGS) and hallucinates content to fill holes, risking geometric inaccuracy despite photorealism. The evaluation is limited to only 5 scenes for segmentation and 7 for rendering, raising questions about generalization. The comparisons with 3DGS methods are misleading—those methods require posed inputs while the proposed method uses MV-DUSt3R specifically designed for unposed reconstruction. The thresholds $\tau = 3/4$ and $\eta = 1/3$ are presented without ablation or theoretical justification, appearing arbitrary. Finally, the method is not truly training-free: it relies on three separately trained foundation models (MV-DUSt3R, SAM, See3D), merely avoiding per-scene optimization.

“Ours: ... Inference 2.5 mins ... Total 3.5 mins; 3DGS: ... Total 9 mins”
Xia & Liu, Sec. 4.2 · Table 4
“We show overall instance segmentation performance using AP, AP50, and AP75 metrics on the selected scenes”
Xia & Liu, Sec. 4.2 · Table 1 caption
Evidence and comparison

The evidence supports the claim of improved instance segmentation over PE3R (25.0 vs 20.2 AP), but PE3R uses SAM2 for temporal tracking, which is disadvantaged on sparse non-sequential views compared to geometric warping. The novel view synthesis results (PSNR 18.15) beat sparse-view 3DGS baselines, but this is expected since 3DGS fails completely without pose initialization while MV-DUSt3R provides it. The comparison with LVSM (PSNR 12.92) is unfavorable to LVSM, which is designed for different input conditions. The ablation studies (Fig. 8, 9) effectively demonstrate that anomaly removal and reference masking are necessary for clean results, though they do not isolate the contribution of the diffusion model vs. simple point-cloud projection.

“3DGS: PSNR 15.61; PE3R + Our render.: PSNR 14.68; MV-DUSt3R+ + Our render.: PSNR 17.69; Ours: PSNR 18.15”
Xia & Liu, Sec. 4.2 · Table 3
“methods are generally designed for forward-facing capture settings or relatively simple 360-degree object-centric scenarios, and thus tend to deteriorate significantly when applied to room-scale scenes”
Xia & Liu, Sec. 4.2 · Section on Rendering
Reproducibility

The method relies on publicly available models (MV-DUSt3R, MobileSAMv2, See3D) with specified hyperparameters ($\tau = 3/4$, $\eta = 1/3$), making reproduction possible if code is released. However, the paper does not mention code availability (only a project page is cited), and the See3D model itself may have specific licensing or availability constraints not discussed. The camera pose alignment step for testing requires ground-truth poses of input images to compute a transformation to the predicted coordinate system, which limits evaluation to datasets with pose annotations and may propagate errors from MV-DUSt3R's pose prediction.

“we compute the transformation between the ground-truth poses of the input images and the corresponding predicted poses from the feed-forward reconstruction network”
Xia & Liu, Sec. 4.1 · Testing novel view poses
“we apply our anomaly point elimination methods with the parameters $\tau$ set to $3/4$ ... we set the threshold $\eta$ to $1/3$”
Xia & Liu, Sec. 4.1 · Experimental details
Abstract

We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: https://jiatongxia.github.io/TID3R/

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.