Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

cs.CL cs.AI cs.LG Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He · Mar 22, 2026

What it does

Why it matters

Experiments across seven benchmarks and three models (Llama-3. 1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses selection bias (position and label bias) in large language models during discrete-choice tasks like multiple-choice questions and pairwise evaluation. The authors propose Permutation-Aware GRPO (PA-GRPO), which extends Group Relative Policy Optimization by treating different permutations of the same question as a single training group rather than independent instances. The method enforces semantic consistency across permutations through two mechanisms: a cross-permutation advantage that computes rewards relative to the group mean, and a consistency-aware reward that penalizes disagreement across permutations. Experiments across seven benchmarks and three models (Llama-3.1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.

Critical review

Verdict

Bottom line

PA-GRPO represents a solid methodological contribution to alignment via reinforcement learning, offering a principled way to encode permutation invariance into the training objective. The paper identifies a genuine failure mode in standard GRPO—termed "permutation-blindness"—where models can achieve high reward under favorable option orderings while failing on others, without any training signal to penalize this inconsistency. The solution is technically sound: by lifting advantage estimation to the permutation-group level and introducing a consistency reward, the model is incentivized to develop robust, order-invariant reasoning. The experimental validation is thorough, covering both LLM-as-a-Judge tasks (MT-Bench, JudgeBench) and MCQ benchmarks (GPQA, ARC-Challenge), with consistent gains in consistency metrics across all settings.

“selection bias reflects a failure of robust reasoning in discrete-choice prompting: when only non-semantic factors (labels or positions) change, the model should preserve the same semantic decision”

Zheng et al. (this paper) · Introduction

What holds up

The problem formulation and theoretical grounding are rigorous. The distinction between permutation-blind standard GRPO and the proposed permutation-aware approach is clearly articulated, and the ablation studies in Table 2 validate that both the Cross-Permutation Advantage ($A_{PA}$) and Consistency-Aware Reward ($r_{con}$) contribute meaningfully to performance. The evaluation protocol is commendably thorough: they test on Llama-3.1-8B, Qwen3-8B, and Qwen3-32B, using full permutation expansion ($N!$) at test time rather than random sampling, ensuring unbiased consistency estimates. The results show substantial improvements in Consistency (e.g., MT-Bench consistency rising from 80.6% to 88.0% on Llama-3.1-8B) and Consistent Accuracy without catastrophic drops in standard accuracy.

“Crucially, the two components are complementary: $r_{con}$ shapes the model toward agreement, while $A_{PA}$ stabilizes the group-level optimization signal”

Zheng et al. (this paper) · Table 2 (Ablation)

Main concerns

First, the method is strictly limited to discrete-choice settings (MCQ and pairwise comparisons) where permutation groups are well-defined; extending PA-GRPO to open-ended generation remains unresolved, as the authors acknowledge in the Limitations section. Second, training computational cost is significantly higher than standard GRPO: for MCQ tasks, the method requires $P \times N$ samples per instance (with $P=5$ permutations and $N=8$ samples each), yet the ablation in Figure 3 shows $P=24$ (full permutation) yields only marginal gains over $P=5$, suggesting the structured subset is a necessary but potentially suboptimal approximation. Third, the Consistent Accuracy (CA) metric relies on majority voting across permutations, which can mask systematic errors—a model consistently choosing the wrong answer across all 24 permutations would score perfectly on Consistency and CA. Finally, Appendix H reveals residual sensitivity: the model is most robust under coupled label-order permutations (the training condition) but less so under isolated label-only or order-only perturbations, suggesting it learns training-set specific biases rather than true semantic invariance.

“residual sensitivity to display order is often more pronounced than sensitivity to label symbols, particularly on LLM-as-a-Judge benchmarks”

Zheng et al. (this paper) · Appendix H

“For MCQ ... The full permutation space ($4!=24$) is costly in computation. We employ a structured subset strategy consisting of four cyclic shifts and one reverse order”

Zheng et al. (this paper) · Section 3.2

Evidence and comparison

The comparison against baselines is generally fair but mixes methodologies: inference-time methods (PriDe, CalibraEval, UniBias) require no training but add computational overhead at inference, while PIF (supervised fine-tuning) and GRPO (RL) are training-time methods like PA-GRPO. The comparisons with PIF and GRPO are the most relevant, and PA-GRPO consistently outperforms these, particularly on consistency metrics. However, the comparison with inference-time methods should acknowledge that PA-GRPO's gains come at the cost of expensive RL training with group sampling. Table 1 shows PA-GRPO achieves superior Consistent Accuracy (e.g., 75.0% vs 73.0% for GRPO on TinyMMLU for Llama-3.1-8B), though some benchmarks show minor accuracy trade-offs (e.g., RewardBench drops from 76.3% to 75.8% for GRPO vs PA-GRPO on Llama-3.1-8B). The evidence supports the claim that PA-GRPO reduces selection bias, but the magnitude of improvement over strong baselines like PIF is sometimes modest (e.g., 1-3% absolute on CA).

“PA-GRPO: 75.0 (CA on TinyMMLU) vs GRPO: 73.0 vs PIF: 57.0”

Zheng et al. (this paper) · Table 1

“We compare PA-GRPO with five strong baselines covering both inference-time debiasing and training-time alignment”

Zheng et al. (this paper) · Section 4.3

Reproducibility

The paper provides substantial implementation detail that aids reproduction. Training uses the open-source verl RL framework with LoRA ($r=32, \alpha=64$), and hyperparameters are specified in Appendix F: AdamW optimizer with learning rate $10^{-5}$, KL coefficient $\beta=0.001$, and batch sizes of 40 (MCQ) or 32 (Judge). The authors describe their data filtering protocol—retaining only instances where the base model shows inconsistent predictions across permutations—which is critical for reproducing the training set but introduces a dependency on the specific base model's initial behavior. The paper states code will be made available on GitHub, but as of submission, it is not yet accessible. Full permutation evaluation at test time (using all $N!$ permutations) is clearly defined and should be reproducible, though computationally expensive (24 evaluations per MCQ instance).

“The models were optimized using AdamW with a learning rate of 1e-5 for 2 epochs. We set the KL regularization coefficient $\beta=0.001$”

Zheng et al. (this paper) · Appendix F

“We specifically retained instances where the model yielded inconsistent predictions across permutations, as these samples provide the strongest signal for learning permutation invariance”

Zheng et al. (this paper) · Appendix D

Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.