Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors

cs.AI cs.GT Johnathan Sun, Andrew Zhang · Mar 22, 2026

What it does

Why it matters

Rather than treating models as black boxes via prompting, the authors construct 'persona vectors'—linear directions in activation space—for traits like altruism and forgiveness using contrastive activation addition. Applied to six...

Main concern

The paper presents a solid extension of activation steering to game-theoretic domains, demonstrating that persona vectors can shift both quantitative strategic choices (e. g.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

The paper tackles the challenge of controlling high-level behavioral traits in LLM agents deployed in strategic settings. Rather than treating models as black boxes via prompting, the authors construct 'persona vectors'—linear directions in activation space—for traits like altruism and forgiveness using contrastive activation addition. Applied to six canonical games, these vectors allow both measurement of behavioral tendencies and causal steering of decisions, offering a mechanistic handle on strategic behavior.

Critical review

Verdict

Bottom line

The paper presents a solid extension of activation steering to game-theoretic domains, demonstrating that persona vectors can shift both quantitative strategic choices (e.g., dollars shared in Dictator games) and qualitative reasoning. The finding that rhetoric and strategy can diverge under steering is a valuable contribution to alignment research, as is the evidence that self-behavior and expectations of others occupy partially distinct subspaces. However, the study is limited to a single model (Qwen 2.5-7B) and relies heavily on GPT-4.1-mini as both trait rater and strategy extractor, introducing potential circularity.

What holds up

The core claim that activation steering causally affects strategic behavior holds up: positive steering coefficients (β>0) systematically increase generosity across Dictator, Ultimatum, and Apology games, with giving rising from $15 at baseline to $55 when β=3 (Section 4.2, Fig. 4). The qualitative analysis of reasoning patterns is compelling—high-altruism steering produces empathetic framing while negative steering emphasizes payoff maximization (Section 4.3). The partial separability of self-behavior vectors from expectations vectors (Fig. 6) is well-supported and suggests models maintain distinct representations for 'I am altruistic' versus 'others are altruistic'.

“in the Dictator Game, the model donates $15 on average at baseline (β=0) but up to $55—more than half its endowment—when β=3”

paper · Section 4.2

“model expectations vary more—and in the intended direction—when steering the expectations vectors than when steering the original altruism and forgiveness vectors”

paper · Section 5.2

Main concerns

The most significant concern is the rhetoric-strategy divergence: in forgiveness experiments, higher β produces more 'forgiving' rhetoric but less forgiving strategic choices in Trust and Costly Punishment games (Section 5.1). This gap between stated reasoning and actual behavior poses serious challenges for alignment evaluation. Additionally, the asymmetry between positive and negative steering (positive steering works reliably; negative steering is 'weaker and more variable') is noted but not deeply explained—the authors acknowledge that altruism and selfishness may not lie on the same linear axis, yet the paper proceeds as if single-direction vectors suffice (Section 4.2). Finally, all experiments are one-shot anonymous games, whereas real strategic behavior emerges through repeated interaction and reputation formation.

“ratings and strategy can move in opposite directions... the model's strategic choices become less forgiving as β increases—the opposite of what we would expect”

paper · Section 5.1

“Altruism-suppressing steering (β<0) produces weaker and more variable effects... consistent with Chen et al. (2025), who observe that steering toward a trait is generally more effective than steering away from it”

paper · Section 4.2

Evidence and comparison

The evidence supports the main claims about steering efficacy, though with caveats. The comparison to Chen et al. (2025) is fair and properly credited—the authors note their work is 'methodologically closest' and confirm the asymmetry finding. However, relying on GPT-4.1-mini as a judge for both constructing vectors and evaluating steered outputs creates a potential circularity: the judge may share biases with the steered model or rate surface rhetoric over strategic substance (Section 3.4). The paper acknowledges this limitation but doesn't quantify its impact. The contrast with prior game-theoretic LLM studies (Akata et al., Fontana et al.) appropriately positions this work as offering mechanistic insight rather than just behavioral description.

“Our reliance on GPT-4.1-mini as both trait rater and game judge introduces potential circularity and shared biases; human evaluation or alternative automated judges would strengthen our conclusions”

paper · Section 6

Reproducibility

Reproducibility is moderately strong. The authors provide code, prompts, and additional figures at github.com/johnathansun/persona-vector-agents (Section 6.1). Hyperparameters are clearly specified: steering at layer 20 with β∈[-5,5], using Qwen 2.5-7B. However, exact reproduction requires access to GPT-4.1-mini for the rating and extraction pipeline, which introduces API-dependent variability. The paper lacks ablations on layer choice (only layer 20 is reported despite testing all 28) and doesn't specify the random seed or sampling temperature used for generation. The filtering threshold (positive responses ≥50, negative <50) is defined but the sensitivity to this cutoff isn't explored.

“Our code, prompts, and additional figures are available at github.com/johnathansun/persona-vector-agents”

paper · Section 6.1

“we focus on layer 20, which produced stable and interpretable effects”

paper · Section 3.2

Abstract

Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.