4DGS360: 360{\deg} Gaussian Reconstruction of Dynamic Objects from a Single Video
4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct occluded regions at extreme viewpoints (>90°). The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360 benchmark featuring test cameras up to 135° from training views, the method enables coherent 360° 4D reconstruction without diffusion priors.
The paper presents a technically sound solution to a genuine problem in dynamic view synthesis. The anchor-based tracking approach effectively bridges the gap between robust 2D correspondence and geometric 3D consistency, yielding measurable improvements on extreme novel views. The iPhone360 dataset fills an important evaluation gap, though its small size (6 scenes) limits generalization claims. The work serves as a strong diffusion-free baseline, though the restriction to static appearance and foreground objects constrains practical applicability.
The core technical contribution—using confident 2D tracks as anchors for 3D trajectory estimation—is well-motivated and effectively addresses drift accumulation in pure 3D trackers. The ablation study demonstrates that removing anchor guidance ('w/o Anchor') leads to catastrophic failure in long sequences, validating the design. The ARAP regularization $\mathcal{L}_{\text{arap}} = w_1 \sum_{(i,j)\in\mathcal{N}} |\|\mathbf{x}_i^t - \mathbf{x}_j^t\|_2 - \|\mathbf{x}_i^{t'} - \mathbf{x}_j^{t'}\|_2| + w_2 \sum_{(i,j)\in\mathcal{N}} \|\mathbf{T}_j^{t^{-1}}(\mathbf{x}_i^t) - \mathbf{T}_j^{t'^{-1}}(\mathbf{x}_i^{t'})\|_2$ combined with the proposed initialization enables previously ineffective rigidity constraints to function correctly in occluded regions.
The method exhibits several limitations. First, the approach assumes fixed illumination over time, preventing reconstruction of scenes with changing lighting. Second, the evaluation relies heavily on perceptual metrics (CLIP-I, CLIP-T) rather than pixel-level accuracy, with PSNR/SSIM results relegated to supplementary material, potentially obscuring alignment issues. Third, while iPhone360 enables 360° evaluation, it comprises only 6 scenes, raising concerns about statistical significance. Finally, the system depends on a cascade of pretrained models (depth estimation, 2D tracking, camera pose estimation), where errors in any stage propagate to the final geometry.
The evidence supports claims of superior 360° reconstruction compared to HiMoR and MoSca on the new iPhone360 dataset, with consistent improvements across CLIP-based metrics and LPIPS. However, the comparison relies on a dataset introduced by the authors themselves, and the metric choice favors perceptual similarity over geometric accuracy. The paper notes that existing methods 'fail to reconstruct regions observed at extremely novel viewpoints (>90°),' but does not provide detailed failure analysis of whether competitors fail due to tracking drift, insufficient regularization, or representation limitations.
The paper mentions a project page (https://jaewon040.github.io/4dgs360/) but does not explicitly confirm code release in the provided text. Reproduction requires multiple pretrained components: depth maps, 2D tracking (BootsTAP), and camera parameters. Hyperparameters for optimization (window sizes $L=16$, loss weights $\lambda_{rgb}, \lambda_{arap}$, etc.) are referenced but detailed in supplementary material. The iPhone360 dataset is newly introduced and availability is not confirmed, blocking independent validation of benchmark claims. The method requires per-scene optimization (similar to NeRF/3DGS), with training time not reported.
We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.