Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
The paper tackles the challenge of controlling high-level behavioral traits in LLM agents deployed in strategic settings. Rather than treating models as black boxes via prompting, the authors construct 'persona vectors'—linear directions in activation space—for traits like altruism and forgiveness using contrastive activation addition. Applied to six canonical games, these vectors allow both measurement of behavioral tendencies and causal steering of decisions, offering a mechanistic handle on strategic behavior.
The paper presents a solid extension of activation steering to game-theoretic domains, demonstrating that persona vectors can shift both quantitative strategic choices (e.g., dollars shared in Dictator games) and qualitative reasoning. The finding that rhetoric and strategy can diverge under steering is a valuable contribution to alignment research, as is the evidence that self-behavior and expectations of others occupy partially distinct subspaces. However, the study is limited to a single model (Qwen 2.5-7B) and relies heavily on GPT-4.1-mini as both trait rater and strategy extractor, introducing potential circularity.
The core claim that activation steering causally affects strategic behavior holds up: positive steering coefficients (β>0) systematically increase generosity across Dictator, Ultimatum, and Apology games, with giving rising from $15 at baseline to $55 when β=3 (Section 4.2, Fig. 4). The qualitative analysis of reasoning patterns is compelling—high-altruism steering produces empathetic framing while negative steering emphasizes payoff maximization (Section 4.3). The partial separability of self-behavior vectors from expectations vectors (Fig. 6) is well-supported and suggests models maintain distinct representations for 'I am altruistic' versus 'others are altruistic'.
The most significant concern is the rhetoric-strategy divergence: in forgiveness experiments, higher β produces more 'forgiving' rhetoric but less forgiving strategic choices in Trust and Costly Punishment games (Section 5.1). This gap between stated reasoning and actual behavior poses serious challenges for alignment evaluation. Additionally, the asymmetry between positive and negative steering (positive steering works reliably; negative steering is 'weaker and more variable') is noted but not deeply explained—the authors acknowledge that altruism and selfishness may not lie on the same linear axis, yet the paper proceeds as if single-direction vectors suffice (Section 4.2). Finally, all experiments are one-shot anonymous games, whereas real strategic behavior emerges through repeated interaction and reputation formation.
The evidence supports the main claims about steering efficacy, though with caveats. The comparison to Chen et al. (2025) is fair and properly credited—the authors note their work is 'methodologically closest' and confirm the asymmetry finding. However, relying on GPT-4.1-mini as a judge for both constructing vectors and evaluating steered outputs creates a potential circularity: the judge may share biases with the steered model or rate surface rhetoric over strategic substance (Section 3.4). The paper acknowledges this limitation but doesn't quantify its impact. The contrast with prior game-theoretic LLM studies (Akata et al., Fontana et al.) appropriately positions this work as offering mechanistic insight rather than just behavioral description.
Reproducibility is moderately strong. The authors provide code, prompts, and additional figures at github.com/johnathansun/persona-vector-agents (Section 6.1). Hyperparameters are clearly specified: steering at layer 20 with β∈[-5,5], using Qwen 2.5-7B. However, exact reproduction requires access to GPT-4.1-mini for the rating and extraction pipeline, which introduces API-dependent variability. The paper lacks ablations on layer choice (only layer 20 is reported despite testing all 28) and doesn't specify the random seed or sampling temperature used for generation. The filtering threshold (positive responses ≥50, negative <50) is defined but the sensitivity to this cutoff isn't explored.
Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.