ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Direct Preference Optimization (DPO) for Vision-Language Models suffers from Likelihood Displacement, where optimization collapses the probabilities of both chosen and rejected responses, causing models to abandon visual evidence for language priors. This paper proposes Asymmetric Constrained Preference Optimization (ACPO), which applies dynamic, length-aware scaling exclusively to the rejected reward term, preserving the chosen distribution as a stable anchor while selectively suppressing incorrect outputs.
The paper presents a technically rigorous solution to Likelihood Displacement in multimodal alignment. The derivation of the dynamic scaling coefficient $\alpha^* = (r(y_w) - \tau_{\text{batch}})/r(y_l)$ is elegant, and the gradient analysis formally demonstrates how asymmetric gradient suppression halts the collapse of chosen rewards. Empirical results on InternVL3 models show consistent gains across hallucination benchmarks (POPE, HallusionBench) without degrading general capabilities (MMBench, OCRBench). However, the work is limited by its reliance on a proprietary 320K-pair dataset and its claims of modality-agnosticism remain unvalidated beyond vision-language tasks.
The theoretical framework is sound: the Length-Adaptive Advantage Target $\tau(y_w, y_l) \triangleq \delta \cdot (|y_w| + |y_l|)$ addresses length bias without arbitrary margin guessing, and the ablation studies conclusively demonstrate that removing asymmetric control (forcing $\alpha=1$) causes immediate performance degradation on POPE (89.22 to 86.89). The training dynamics visualizations (Figures 2-3) provide compelling empirical evidence that ACPO decouples chosen reward stability from margin optimization, with chosen rewards stabilizing near +10 while DPO collapses to +2.
Four critical limitations stand out. First, the paper claims ACPO is "modality-agnostic" yet provides zero validation on text-only or other modalities, restricting generalizability claims. Second, the core hyperparameter $\delta=0.1$ is set empirically without sensitivity analysis or theoretical justification, and the clamping range $[0.3, 0.95]$ for $\hat{\alpha}$ in final optimization arbitrarily deviates from the theoretical $[0,1]$ bound without ablation. Third, the 320K preference dataset is proprietary and internally curated, making independent reproduction impossible and raising questions about data contamination. Fourth, the analysis assumes $r(y_l) < 0$ as the "steady-state condition," but the behavior when this assumption breaks (early training) is not characterized.
The evidence supports the primary claim that ACPO mitigates Likelihood Displacement: ACPO achieves POPE scores of 89.22 (14B) and 89.32 (8B), outperforming DPO (86.89, 86.85) and SimPO (87.81, 83.26). The bootstrap significance test ($p<0.01$) adds statistical rigor. However, the comparison conflates algorithmic improvements with dataset effects since the proprietary preference data (constructed via GPT-4o and rule-based sampling) differs from public splits used by baselines in prior work. The cross-attention analysis (Figure 5) showing +13.9% peak advantage in visual attention retention provides mechanistic evidence, though the sample size for this analysis is not specified.
Reproducibility is significantly hampered: the 320K-pair preference dataset is proprietary and not released, with construction relying on GPT-4o access and internal rule-based correctness filters that lack implementation details. Code release is not mentioned. While training hyperparameters are specified ($\beta=0.1$, lr=$1\times10^{-6}$, batch size 32, 1 epoch), the exact data curation pipelines—particularly the "Visual Grounding Contrast" strategy using degraded visual conditions—are described only at a high level. The experiments use InternVL3-Instruct models with frozen vision encoders, which is standard, but without data access, independent verification of the claimed hallucination improvements is impossible.
While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.