4DGS360: 360{\deg} Gaussian Reconstruction of Dynamic Objects from a Single Video

cs.CV Jae Won Jang, Yeonjin Chang, Wonsik Shin, Juhwan Cho, Nojun Kwak · Mar 23, 2026
Local to this browser
What it does
4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct...
Why it matters
The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360...
Main concern
The paper presents a technically sound solution to a genuine problem in dynamic view synthesis. The anchor-based tracking approach effectively bridges the gap between robust 2D correspondence and geometric 3D consistency, yielding...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct occluded regions at extreme viewpoints (>90°). The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360 benchmark featuring test cameras up to 135° from training views, the method enables coherent 360° 4D reconstruction without diffusion priors.

Critical review
Verdict
Bottom line

The paper presents a technically sound solution to a genuine problem in dynamic view synthesis. The anchor-based tracking approach effectively bridges the gap between robust 2D correspondence and geometric 3D consistency, yielding measurable improvements on extreme novel views. The iPhone360 dataset fills an important evaluation gap, though its small size (6 scenes) limits generalization claims. The work serves as a strong diffusion-free baseline, though the restriction to static appearance and foreground objects constrains practical applicability.

“While our AnchorTAP3D improves significantly upon naive applications of off-the-shelf 2D and 3D tracking models, the overall performance still depends on the capability of pretrained models.”
paper · Section 5
“our model assumes a fixed color for each Gaussian over time so it cannot account for illumination changes in real-world scenes.”
paper · Section 5
What holds up

The core technical contribution—using confident 2D tracks as anchors for 3D trajectory estimation—is well-motivated and effectively addresses drift accumulation in pure 3D trackers. The ablation study demonstrates that removing anchor guidance ('w/o Anchor') leads to catastrophic failure in long sequences, validating the design. The ARAP regularization $\mathcal{L}_{\text{arap}} = w_1 \sum_{(i,j)\in\mathcal{N}} |\|\mathbf{x}_i^t - \mathbf{x}_j^t\|_2 - \|\mathbf{x}_i^{t'} - \mathbf{x}_j^{t'}\|_2| + w_2 \sum_{(i,j)\in\mathcal{N}} \|\mathbf{T}_j^{t^{-1}}(\mathbf{x}_i^t) - \mathbf{T}_j^{t'^{-1}}(\mathbf{x}_i^{t'})\|_2$ combined with the proposed initialization enables previously ineffective rigidity constraints to function correctly in occluded regions.

“'w/o 3D init' directly unprojects 2D tracks, 'w/o Anchor' leverages naive 3D point tracking model [tapip3d], and Ours uses AnchorTAP3D”
paper · Figure 8 caption
“This rigidity constraint allows temporally coherent propagation of structural information, enabling 360° geometry even under large motion or occlusion.”
paper · Section 3.3
Main concerns

The method exhibits several limitations. First, the approach assumes fixed illumination over time, preventing reconstruction of scenes with changing lighting. Second, the evaluation relies heavily on perceptual metrics (CLIP-I, CLIP-T) rather than pixel-level accuracy, with PSNR/SSIM results relegated to supplementary material, potentially obscuring alignment issues. Third, while iPhone360 enables 360° evaluation, it comprises only 6 scenes, raising concerns about statistical significance. Finally, the system depends on a cascade of pretrained models (depth estimation, 2D tracking, camera pose estimation), where errors in any stage propagate to the final geometry.

“Furthermore, our method cannot synthesize extreme viewpoint's background regions if it is invisible in the input video.”
paper · Section 5
“recent work [liang2025himor] has shown that these pixel-level metrics are misaligned with perceptual quality in monocular dynamic 3D reconstruction with view disparity”
paper · Section 4.1
Evidence and comparison

The evidence supports claims of superior 360° reconstruction compared to HiMoR and MoSca on the new iPhone360 dataset, with consistent improvements across CLIP-based metrics and LPIPS. However, the comparison relies on a dataset introduced by the authors themselves, and the metric choice favors perceptual similarity over geometric accuracy. The paper notes that existing methods 'fail to reconstruct regions observed at extremely novel viewpoints (>90°),' but does not provide detailed failure analysis of whether competitors fail due to tracking drift, insufficient regularization, or representation limitations.

“We further present iPhone360, a new benchmark where test cameras are placed up to 135° apart from training views”
paper · Abstract
“existing methods still fail to reconstruct regions observed at extremely novel viewpoints (e.g., >90° from the current view)”
paper · Section 1
Reproducibility

The paper mentions a project page (https://jaewon040.github.io/4dgs360/) but does not explicitly confirm code release in the provided text. Reproduction requires multiple pretrained components: depth maps, 2D tracking (BootsTAP), and camera parameters. Hyperparameters for optimization (window sizes $L=16$, loss weights $\lambda_{rgb}, \lambda_{arap}$, etc.) are referenced but detailed in supplementary material. The iPhone360 dataset is newly introduced and availability is not confirmed, blocking independent validation of benchmark claims. The method requires per-scene optimization (similar to NeRF/3DGS), with training time not reported.

“During each inference step, we collect reliable 3D anchor points obtained from high-confidence 2D tracks”
paper · Section 3.2
“https://jaewon040.github.io/4dgs360/”
paper · Title page
Abstract

We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.