Behavioural feasible set: Value alignment constraints on AI decision support

cs.AI econ.GN q-fin.EC Taejin Park · Mar 22, 2026

What it does

Why it matters

This paper formalizes these boundaries as a "behavioural feasible set" and demonstrates through controlled experiments that alignment training compresses this set, making AI systems structurally unable to endorse certain legitimate...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Organizations deploying commercial AI systems inherit vendor-imposed value constraints that limit which recommendations the system can produce. This paper formalizes these boundaries as a "behavioural feasible set" and demonstrates through controlled experiments that alignment training compresses this set, making AI systems structurally unable to endorse certain legitimate organizational actions even under strong contextual pressure. The work reframes AI governance from a capability question to a constraint diagnosis problem, showing that vendor selection partially determines which trade-offs remain negotiable for adopting firms.

Critical review

Verdict

Bottom line

The paper presents a compelling conceptual framework connecting technical AI alignment to organizational governance, supported by well-designed experiments comparing pre- and post-alignment model variants. The diagnostic approach using KL divergence bounds to characterize recommendation rigidity is innovative and practical. However, the empirical scope is limited to binary choices and ranking tasks that may oversimplify complex organizational decision-making, and the causal identification relies heavily on the Llama base/instruct comparison while treating commercial models as black boxes where constraints can only be inferred indirectly.

“The governance puzzle is therefore not whether AI can support decisions, but which recommendations the system can actually produce given how its vendor has configured it”

Park, Section 1 · Section 1

“alignment makes the system substantially less able to shift its recommendation even under legitimate contextual pressure”

Park, Abstract · Abstract

What holds up

The conceptual contribution is strong: framing alignment as a constraint on the feasibility set rather than just a bias in outputs is theoretically productive. The experimental design comparing Llama Base and Llama Instruct provides clean causal identification of alignment's effect, showing reversal rates dropping from 66.7% to 8.8% under identical interventions. The mathematical formalization using KL divergence bounds ($\kappa_{rev}(x) \coloneqq d(1/2 \parallel p_0(x))$) translates abstract governance concerns into measurable thresholds. The stakeholder study effectively demonstrates that alignment shifts value priors rather than neutralizing them, reversing shareholder priorities to customer priorities post-training.

“Llama Base achieves reversal in 67% of eligible intervention conditions... Llama Instruct achieves reversal in only 8.8% of eligible conditions”

Park, Section 5.2 · Section 5.2

“alignment shifts implied stakeholder priorities rather than neutralising them”

Park, Abstract · Abstract

Main concerns

The generalizability from experimental scenarios to real organizational decisions remains uncertain. The binary choice design (Option A vs Option B) artificially polarizes decisions when real managerial trade-offs often involve nuanced intermediate positions. The paper acknowledges but does not resolve the challenge that proprietary models' internal reference policies are unobservable, making the KL budget $\kappa(x)$ a latent construct estimated only through behavioural proxies. The claim that "better prompting cannot resolve" these constraints (Section 1) is strong given the limited prompt engineering attempted—only three intervention types were tested per scenario. Additionally, the stakeholder study relies on Borda-normalized rankings which assume linear substitutability between stakeholder interests that may not reflect actual organizational utility functions.

“For commercial models no such counterfactual is available, and the diagnostic bounds should be read as behavioural characterisations under the chosen protocol”

Park, Section 5.1 · Section 5.1

“The constraint arises from how commercial models are built... organisations can act at Levels 2 and 3, but cannot relax Level 1”

Park, Section 1 · Section 1

Evidence and comparison

The evidence supports the core claim that alignment reduces behavioural flexibility, with the Llama comparison providing convincing causal evidence. The comparison to related work is thorough, situating the contribution within agency theory, incomplete contracting, and platform governance literatures. However, the paper could engage more deeply with recent work on prompt engineering and "jailbreaking" research that suggests local configuration can sometimes circumvent alignment constraints, challenging the claim that these bounds are structurally immutable. The domain-specific results showing Physical Safety and Honesty as "zero reversal" zones for commercial models while Third-Party Welfare remains more flexible provide useful nuance, though the small sample size (20 scenarios) limits statistical power for domain-level comparisons.

“Physical Safety and Honesty show zero reversals for the commercial models across all interventions. Third-Party Welfare is the most flexible domain”

Park, Section 5.2 · Section 5.2

“The empirics accordingly implement a diagnostic rather than a structural strategy”

Park, Section 5.1 · Section 5.1

Reproducibility

The paper provides substantial reproducibility details including full scenario texts in Appendix B, exact model versions (gpt-5-mini, claude-haiku-4-5-20251001, llama-3.1-8b), and sampling parameters (temperature 1.0, 50 samples per condition). However, no code repository or raw data download links are provided, and the exact API calls with full prompts are only partially shown. The paper notes that "Llama Base sometimes produced non-conforming outputs; sampling continued until 50 valid responses were obtained per condition" without defining what constitutes "non-conforming" or how this selection might bias results. Commercial model versions may update frequently, potentially making the specific behavioural results ephemeral even if the methodological contribution remains valid.

“For each scenario–condition pair, I draw 50 independent valid samples at temperature 1.0”

Park, Section 5.1 · Section 5.1

“Llama Base (llama-3.1-8b) and Llama Instruct (llama-3.1-8b-instruct)”

Park, Section 5.1 · Section 5.1

Abstract

When organisations adopt commercial AI systems for decision support, they inherit value judgements embedded by vendors that are neither transparent nor renegotiable. The governance puzzle is not whether AI can support decisions but which recommendations the system can actually produce given how its vendor has configured it. I formalise this as a behavioural feasible set, the range of recommendations reachable under vendor-imposed alignment constraints, and characterise diagnostic thresholds for when organisational requirements exceed the system's flexibility. In scenario-based experiments using binary decision scenarios and multi-stakeholder ranking tasks, I show that alignment materially compresses this set. Comparing pre- and post-alignment variants of an open-weight model isolates the mechanism: alignment makes the system substantially less able to shift its recommendation even under legitimate contextual pressure. Leading commercial models exhibit comparable or greater rigidity. In multi-stakeholder tasks, alignment shifts implied stakeholder priorities rather than neutralising them, meaning organisations adopt embedded value orientations set upstream by the vendor. Organisations thus face a governance problem that better prompting cannot resolve: selecting a vendor partially determines which trade-offs remain negotiable and which stakeholder priorities are structurally embedded.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.