Getting to the Point: Why Pointing Improves LVLMs

cs.CV Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi · Mar 23, 2026
Local to this browser
What it does
Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against...
Why it matters
This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate...
Main concern
The paper presents a compelling mechanistic analysis of pointing in LVLMs, demonstrating that spatial supervision via coordinates drives generalization to higher object counts and distractor-rich scenes. With rigorous ablations showing...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate supervision encourages skill learning rather than narrow task memorization, yielding stronger out-of-distribution generalization while providing verifiable visual explanations.

Critical review
Verdict
Bottom line

The paper presents a compelling mechanistic analysis of pointing in LVLMs, demonstrating that spatial supervision via coordinates drives generalization to higher object counts and distractor-rich scenes. With rigorous ablations showing that replacing coordinates with "X" tokens catastrophically degrades OOD performance (dropping over 90% for some models), the evidence strongly supports the claim that PtC fosters skill acquisition. However, the reliance on synthetic grid-based images with uniform black backgrounds limits ecological validity, and observed spatial biases in LLaVA-OneVision and Qwen2.5-VL raise concerns about the robustness of coordinate-based explanations in natural settings.

“removing spatial information consistently reduces accuracy, indicating that spatial information is the key to improving LVLMs' generalization to higher object counts”
paper · Section 4.5
“validating these findings on natural images with clutter and occlusion remains essential”
paper · Section 5
What holds up

The experimental design leveraging the CIVET framework to generate balanced, contamination-free synthetic datasets enables precise control over spatial configuration and object counts, facilitating rigorous OOD evaluation. The ablation studies convincingly isolate spatial encoding as the causal mechanism: models fine-tuned to output "X" instead of coordinates preserve in-distribution performance (>97%) yet catastrophically fail OOD (Qwen2.5-VL 7B drops from 94.78% to 3.36%), while ablating the image entirely when ground-truth coordinates are provided causes minimal performance degradation (<2%). These results demonstrate that models rely primarily on textual coordinates rather than visual features to compute counts.

“removing the image has little to no effect (<2% drop) on performance, indicating that models generate the final count solely based on the coordinates, disregarding almost completely the visual modality”
paper · Table 3
“coordinates are grounded in the image in more than 89% of cases (as measured by F1 score), supporting their use as potential visual explanations”
paper · Section 4.4
Main concerns

The study's heavy reliance on synthetic data with simplified visual characteristics (uniform black backgrounds, no occlusions, grid-based layouts) severely limits claims about real-world applicability. Despite achieving >89% grounding F1 scores, significant spatial biases persist—LLaVA-OneVision exhibits a pronounced left-to-right performance degradation, while Qwen2.5-VL models struggle at bottom and right edges—undermining the reliability of coordinates as universal visual explanations. Furthermore, the observation that model-generated counts frequently disagree with the number of predicted coordinates in OOD settings (consistency dropping to 20-48% for some models) suggests that the interface between grounding and aggregation remains fragile and that coordinates alone do not guarantee correct reasoning.

“LLaVA-OneVision exhibits the most pronounced spatial biases, with performance progressively decreasing from left to right”
paper · Section 4.4
“Cons. 20.56”
paper · Table 2
Evidence and comparison

The evidence robustly supports the core claims regarding OOD generalization, with PtC substantially outperforming Direct Counting when extrapolating to counts beyond the training range (10-18 objects vs. 1-9). The comparison across four diverse LVLMs (Qwen2.5-VL, LLaVA-OneVision, InternVL3.5) strengthens generalizability, though the absence of larger proprietary models leaves open questions about scale effects. The mechanistic analysis through coordinate ablation and image ablation is methodologically sound, directly testing the hypothesis that spatial information drives performance gains rather than merely correlating with them. The finding that computing counts directly from the number of predicted coordinates yields higher accuracy than the model's own final answer (94.78% vs 46.54% for Qwen2.5-VL 7B) reveals a critical failure mode in how models aggregate grounded information.

“using the predicted coordinates to count increases accuracy substantially, with 7B and 8B models exceeding 90%”
paper · Table 1
“computing the count from the number of predicted points yields the highest accuracy across models”
paper · Section 4.2
Reproducibility

The authors demonstrate strong reproducibility practices by releasing code, data, and model checkpoints. Experimental details are comprehensive: LoRA configuration (rank $r=32$, scaling $\alpha=64$), optimizer settings (AdamW, learning rate $1\times 10^{-5}$), early stopping criteria (patience 2), and exact dataset construction protocols ($9\times 9$ grids, $672\times 672$ pixel resolution, 3 pixels padding) are documented in the appendix. The use of greedy decoding with explicit regex-based answer extraction and coordinate parsing further supports replication. However, the computational requirement (NVIDIA A100 80GB, up to 2 days per experiment) and reliance on specific synthetic data generation pipelines may pose barriers for researchers with limited resources.

“To support reproducibility, we release code, data, and model checkpoints on GitHub”
paper · Abstract
“rank r=32, scaling alpha=64 ... learning rate of 1e-5 ... NVIDIA A100 (80 GiB) GPU”
paper · Appendix B
Abstract

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.