ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

cs.CV Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Direct Preference Optimization (DPO) for Vision-Language Models suffers from Likelihood Displacement, where optimization collapses the probabilities of both chosen and rejected responses, causing models to abandon visual evidence for language priors. This paper proposes Asymmetric Constrained Preference Optimization (ACPO), which applies dynamic, length-aware scaling exclusively to the rejected reward term, preserving the chosen distribution as a stable anchor while selectively suppressing incorrect outputs.

Critical review

Verdict

Bottom line

The paper presents a technically rigorous solution to Likelihood Displacement in multimodal alignment. The derivation of the dynamic scaling coefficient $\alpha^* = (r(y_w) - \tau_{\text{batch}})/r(y_l)$ is elegant, and the gradient analysis formally demonstrates how asymmetric gradient suppression halts the collapse of chosen rewards. Empirical results on InternVL3 models show consistent gains across hallucination benchmarks (POPE, HallusionBench) without degrading general capabilities (MMBench, OCRBench). However, the work is limited by its reliance on a proprietary 320K-pair dataset and its claims of modality-agnosticism remain unvalidated beyond vision-language tasks.

“\alpha^{*}=\frac{r(y_{w})-\tau_{\text{batch}}}{r(y_{l})}”

ACPO paper, Section 4.3.2 · Equation 11

“ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference.”

ACPO paper, Section 1 · Abstract

What holds up

The theoretical framework is sound: the Length-Adaptive Advantage Target $\tau(y_w, y_l) \triangleq \delta \cdot (|y_w| + |y_l|)$ addresses length bias without arbitrary margin guessing, and the ablation studies conclusively demonstrate that removing asymmetric control (forcing $\alpha=1$) causes immediate performance degradation on POPE (89.22 to 86.89). The training dynamics visualizations (Figures 2-3) provide compelling empirical evidence that ACPO decouples chosen reward stability from margin optimization, with chosen rewards stabilizing near +10 while DPO collapses to +2.

“w/o Asymmetric Control ($\alpha=1$): When we force a symmetric update by disabling $\hat{\alpha}$, the objective formally degenerates to standard symmetric DPO... resulting in a significant drop in the POPE score (from 89.22 to 86.89).”

ACPO paper, Section 5.4 · Table 2

“ACPO breaks this symmetric coupling: the chosen reward $r(y_{w})$ remains stable as an anchor, while the rejected reward $r(y_{l})$ absorbs most of the optimization pressure.”

ACPO paper, Section 5.3 · Figure 2 caption

Main concerns

Four critical limitations stand out. First, the paper claims ACPO is "modality-agnostic" yet provides zero validation on text-only or other modalities, restricting generalizability claims. Second, the core hyperparameter $\delta=0.1$ is set empirically without sensitivity analysis or theoretical justification, and the clamping range $[0.3, 0.95]$ for $\hat{\alpha}$ in final optimization arbitrarily deviates from the theoretical $[0,1]$ bound without ablation. Third, the 320K preference dataset is proprietary and internally curated, making independent reproduction impossible and raising questions about data contamination. Fourth, the analysis assumes $r(y_l) < 0$ as the "steady-state condition," but the behavior when this assumption breaks (early training) is not characterized.

“the length-scaling factor $\delta$ used to define the Length-Adaptive Advantage Target $\tau_{\text{batch}}$ is empirically set to $0.1$ across all experiments.”

ACPO paper, Section 5.1 · Implementation Details

“we apply a slightly tighter empirical clipping window of $[0.3,0.95]$ during the final optimization”

ACPO paper, Section 5.1 · Implementation Details

Evidence and comparison

The evidence supports the primary claim that ACPO mitigates Likelihood Displacement: ACPO achieves POPE scores of 89.22 (14B) and 89.32 (8B), outperforming DPO (86.89, 86.85) and SimPO (87.81, 83.26). The bootstrap significance test ($p<0.01$) adds statistical rigor. However, the comparison conflates algorithmic improvements with dataset effects since the proprietary preference data (constructed via GPT-4o and rule-based sampling) differs from public splits used by baselines in prior work. The cross-attention analysis (Figure 5) showing +13.9% peak advantage in visual attention retention provides mechanistic evidence, though the sample size for this analysis is not specified.

“on the object hallucination benchmark POPE, standard DPO significantly degrades the base Instruct model's performance from 88.48 to 86.89... ACPO not only prevents this degradation but boosts the score to 89.22”

ACPO paper, Section 5.2 · Table 1

“The gains of ACPO over DPO are formally shown to be statistically significant ($p<0.01$).”

ACPO paper, Section 5.2 · Statistical Testing

Reproducibility

Reproducibility is significantly hampered: the 320K-pair preference dataset is proprietary and not released, with construction relying on GPT-4o access and internal rule-based correctness filters that lack implementation details. Code release is not mentioned. While training hyperparameters are specified ($\beta=0.1$, lr=$1\times10^{-6}$, batch size 32, 1 epoch), the exact data curation pipelines—particularly the "Visual Grounding Contrast" strategy using degraded visual conditions—are described only at a high level. The experiments use InternVL3-Instruct models with frozen vision encoders, which is standard, but without data access, independent verification of the claimed hallucination improvements is impossible.

“we curate a proprietary internal dataset comprising approximately 320K preference pairs... chosen responses are generated by GPT-4o with full visual access, while rejected responses are produced by the SFT model under degraded visual conditions”

ACPO paper, Section 5.1 · Datasets

“The current evaluation relies on a proprietary preference dataset.”

ACPO paper, Section 6 · Limitations

Abstract

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.