Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Group3D addresses open-vocabulary 3D object detection from multi-view RGB images by integrating semantic constraints directly into instance construction. Unlike prior work that merges fragments based solely on geometric consistency, it leverages a multimodal large language model to organize scene vocabularies into semantic compatibility groups that gate cross-view fragment association. This prevents irreversible over-merging when geometric evidence is incomplete, achieving state-of-the-art results on ScanNet and ARKitScenes in both pose-known and challenging pose-free zero-shot settings.
The paper presents a compelling solution to geometry-driven over-merging in multi-view 3D detection by injecting MLLM-derived semantic priors into the merging process. The gains over Zoo3D—17 mAP points in the pose-free setting on ScanNet20—demonstrate that semantic gating significantly advances reconstruction-based pipelines. However, the heavy reliance on proprietary GPT-5.1 for core functionality raises accessibility concerns for reproducibility.
The semantic compatibility grouping mechanism is well-motivated and empirically validated; Table 5 shows that removing category constraints and merging purely by geometry drops mAP25 from 41.2 to 28.2, while strict same-category merging achieves only 35.9. The two-memory architecture cleanly separates semantic aggregation (Scene Vocabulary Memory) from geometric lifting (3D Fragment Memory). The method demonstrates strong zero-shot generalization across benchmarks without any task-specific training or 3D supervision.
The pose-free performance (41.2 mAP25) lags significantly behind pose-known (51.1 mAP25) on ScanNet20, indicating that semantic grouping cannot fully compensate for reconstruction noise in camera poses and depth. The MLLM grouping relies on heuristic prompts that might fail for out-of-distribution categories or complex part-whole relationships not explicitly excluded (e.g., distinguishing "keyboard" from "laptop"). Additionally, the voxel overlap thresholds ($\tau_{iou}=0.01$, $\tau_{cont}=0.10$) appear empirically tuned without sensitivity analysis in the main text, raising questions about generalization to scenes with different scales or densities.
Results are comprehensive across ScanNet variants (20/60/200) and ARKitScenes, but the comparison landscape conflates point-cloud methods (with GT geometry) and multi-view RGB methods without controlling for input advantages. While ablations validate the necessity of semantic grouping, the paper lacks comparisons against alternative grouping strategies (e.g., WordNet hierarchies or CLIP embedding similarity) to isolate whether the gains stem from MLLM reasoning or simpler linguistic priors. The qualitative groupings in Table 9 show plausible merges (e.g., [washer, washing machine]), but do not quantify failure modes where the MLLM might incorrectly merge distinct categories.
Full reproduction is blocked by reliance on GPT-5.1 [41], a proprietary model with restricted API access that forms the core of the semantic grouping mechanism. While the paper specifies SAM 3 [4] for segmentation and reconstruction backbones (Depth Anything 3 [22] or VGGT [45]), critical implementation details including voxel size ($5\,\mathrm{cm}$), IoU thresholds ($\tau_{iou}=0.01$), and MLLM prompts are scattered between the main text and supplementary material. The paper mentions a project page but does not explicitly commit to code release within the main text, leaving uncertainty about hyperparameter configurations and exact API versioning.
Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.