Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

cs.CV cs.AI Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun · Mar 23, 2026
Local to this browser
What it does
Hyperbolic Vision-Language Models (VLMs) improve hierarchical structure preservation over Euclidean counterparts, yet existing approaches treat all part-whole relationships as equally informative. This paper proposes UNCHA...
Why it matters
This paper proposes UNCHA (UNcertainty-guided Compositional Hyperbolic Alignment), which leverages the hyperbolic radius as an uncertainty measure to quantify the varying semantic representativeness of image parts to the whole scene. By...
Main concern
The paper offers a compelling geometric insight—using hyperbolic radius to encode part-whole representativeness—and backs it with strong empirical results on fine-grained tasks. However, the work is undermined by an abundance of...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Hyperbolic Vision-Language Models (VLMs) improve hierarchical structure preservation over Euclidean counterparts, yet existing approaches treat all part-whole relationships as equally informative. This paper proposes UNCHA (UNcertainty-guided Compositional Hyperbolic Alignment), which leverages the hyperbolic radius as an uncertainty measure to quantify the varying semantic representativeness of image parts to the whole scene. By incorporating this uncertainty into adaptive temperature scaling for contrastive learning and an entropy-regularized entailment loss, UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and fine-grained compositional benchmarks, demonstrating that modeling heterogeneous part-whole strength is critical for complex multi-object understanding.

Critical review
Verdict
Bottom line

The paper offers a compelling geometric insight—using hyperbolic radius to encode part-whole representativeness—and backs it with strong empirical results on fine-grained tasks. However, the work is undermined by an abundance of hyperparameters (λ₁, λ₂, λ_ent, α, η_intra, η_inter, K, and multiple temperatures) without sensitivity analysis, and the specific functional form u(x)=log(1+exp(−∥x∥₂)) is justified only as 'differentiable' and 'well-behaved' rather than theoretically grounded. Furthermore, the ATMG baseline performs suspiciously poorly (e.g., 34.1 vs. 43.9 Top-1 on ImageNet ViT-S) despite claims of identical training configurations, raising concerns about implementation fidelity.

“We leverage the geodesic distance from the origin (radius) in hyperbolic space to quantify the part-to-whole semantic representativeness using hyperbolic uncertainty.”
paper · Section 3.2.1
“Eq. 7 is a smooth monotonic transformation of the hyperbolic radius, which is a differentiable, well-behaved uncertainty measure for numerical stability.”
paper · Section 3.2.1
“ATMG† 34.1 ... UNCHA (Ours) 43.9”
paper · Table 1
What holds up

The central thesis—that parts vary in semantic representativeness and that hyperbolic radius naturally encodes this uncertainty—is well-motivated by cognitive science and prior work linking radius to abstractness. The uncertainty-guided contrastive loss (Eq. 11) elegantly implements this by scaling temperatures per-part via τ_un,i^I=exp(u(i_i^part)/2)τ_gl, allowing less representative parts to contribute less to the alignment objective. The comprehensive evaluation spanning 16 classification datasets, hierarchical metrics, and challenging part-level alignment with hard negatives (Table 3) provides robust evidence of improved compositional understanding. Ablation studies (Table 4) confirm that removing uncertainty modeling, contrastive adaptation, or entropy regularization each degrades performance, validating the architectural choices.

“Our approach incorporates uncertainty into the global-local contrastive loss by considering the varying semantic representativeness of multiple parts... higher uncertainty leads to a larger temperature and a smaller contribution to the contrastive loss.”
paper · Section 3.2.2
“Removing any component leads to consistent performance drops, showing that all modules contribute meaningfully.”
paper · Section 4.4
Main concerns

First, the uncertainty formulation u(x)=log(1+exp(−∥x∥₂)), while numerically stable, lacks theoretical justification beyond monotonicity; it is treated as a proxy for radius without probabilistic grounding or ablation against alternatives. Second, the method introduces a hyperparameter burden (λ₁=0.5, λ₂=10.0, λ_ent=0.2, α=0.1, η_intra=1.2, η_inter=0.7, K=0.1) with no sensitivity analysis, raising questions about robustness and generalization. Third, the ATMG baseline's severe underperformance relative to its original publication suggests potential implementation disparities not fully explained by 'exterior angle' similarity metrics alone. Finally, the uncertainty calibration loss (Eq. 15) combines stop-gradient operations with exponential terms (e^{−u(p)}), which may create optimization instabilities or biased gradient estimates that are not discussed.

“For Eq.17, the weighting coefficients are λ₁=0.5 and λ₂=10.0”
paper · Section S.1.2
“L_ent^cal(p,q)=⌊L_ent^*(p,q)⌋e^{−u(p)}+u(p)+H(ũ(p))”
paper · Section 3.2.3
Evidence and comparison

Comparisons to HyCoCLIP and MERU on the GRIT dataset are fair and show consistent gains: UNCHA improves ImageNet Top-1 by 2.2% (ViT-S) and 3.0% (ViT-B) over HyCoCLIP, and achieves superior hierarchical metrics (TIE↓ 3.39 vs. 3.55). The part-level alignment results on Densely Captioned Images (Table 3) are particularly convincing, with UNCHA achieving 56.51% on hard negatives versus 52.89% for CLIP. However, the strong negative correlation between uncertainty and part-whole similarity shown in Figure 4 (r=−0.739) is somewhat circular since the model is explicitly trained to produce this relationship. The multi-object representation gains on ComCo (Table 5) are substantial, though absolute margins over CLIP are sometimes modest.

“TIE(↓) ... UNCHA (Ours) 3.39 ... HyCoCLIP 3.55”
paper · Table 2
“UNCHA (Ours) 56.51 ... CLIP 52.89”
paper · Table 3
“part-to-whole similarity vs. uncertainty shows a strong negative correlation (r=−0.739)”
paper · Figure 4
Reproducibility

The authors provide detailed training protocols in Appendix S.1: 500K iterations with batch size 768, AdamW optimizer (lr=5×10^{−4}), curvature clamped to [0.1,10.0], and initialization schemes (c_img=1/√512). Code is publicly available. However, critical gaps remain: no discussion of computational overhead relative to standard hyperbolic VLMs; no hyperparameter sensitivity analysis; and insufficient detail on how random crops for part-images are generated (sizes, aspect ratios, filtering). The GRIT dataset is public (20.5M pairs), but reproducing exact part-level preprocessing may require additional clarification. The parameterization of only the space component while deriving the time component (Eq. S.23) could affect numerical stability but is not thoroughly analyzed.

“The batch size and total number of training iterations are fixed at 768 and 500,000, respectively”
paper · Section 4.1
“The curvature of Lorentz space is initialized to κ=1.0 and treated as a learnable parameter, while being clamped in [0.1,10.0] for numerical stability.”
paper · Section S.1.2
Abstract

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.