Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
This paper addresses selection bias (position and label bias) in large language models during discrete-choice tasks like multiple-choice questions and pairwise evaluation. The authors propose Permutation-Aware GRPO (PA-GRPO), which extends Group Relative Policy Optimization by treating different permutations of the same question as a single training group rather than independent instances. The method enforces semantic consistency across permutations through two mechanisms: a cross-permutation advantage that computes rewards relative to the group mean, and a consistency-aware reward that penalizes disagreement across permutations. Experiments across seven benchmarks and three models (Llama-3.1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.
PA-GRPO represents a solid methodological contribution to alignment via reinforcement learning, offering a principled way to encode permutation invariance into the training objective. The paper identifies a genuine failure mode in standard GRPO—termed "permutation-blindness"—where models can achieve high reward under favorable option orderings while failing on others, without any training signal to penalize this inconsistency. The solution is technically sound: by lifting advantage estimation to the permutation-group level and introducing a consistency reward, the model is incentivized to develop robust, order-invariant reasoning. The experimental validation is thorough, covering both LLM-as-a-Judge tasks (MT-Bench, JudgeBench) and MCQ benchmarks (GPQA, ARC-Challenge), with consistent gains in consistency metrics across all settings.
The problem formulation and theoretical grounding are rigorous. The distinction between permutation-blind standard GRPO and the proposed permutation-aware approach is clearly articulated, and the ablation studies in Table 2 validate that both the Cross-Permutation Advantage ($A_{PA}$) and Consistency-Aware Reward ($r_{con}$) contribute meaningfully to performance. The evaluation protocol is commendably thorough: they test on Llama-3.1-8B, Qwen3-8B, and Qwen3-32B, using full permutation expansion ($N!$) at test time rather than random sampling, ensuring unbiased consistency estimates. The results show substantial improvements in Consistency (e.g., MT-Bench consistency rising from 80.6% to 88.0% on Llama-3.1-8B) and Consistent Accuracy without catastrophic drops in standard accuracy.
First, the method is strictly limited to discrete-choice settings (MCQ and pairwise comparisons) where permutation groups are well-defined; extending PA-GRPO to open-ended generation remains unresolved, as the authors acknowledge in the Limitations section. Second, training computational cost is significantly higher than standard GRPO: for MCQ tasks, the method requires $P \times N$ samples per instance (with $P=5$ permutations and $N=8$ samples each), yet the ablation in Figure 3 shows $P=24$ (full permutation) yields only marginal gains over $P=5$, suggesting the structured subset is a necessary but potentially suboptimal approximation. Third, the Consistent Accuracy (CA) metric relies on majority voting across permutations, which can mask systematic errors—a model consistently choosing the wrong answer across all 24 permutations would score perfectly on Consistency and CA. Finally, Appendix H reveals residual sensitivity: the model is most robust under coupled label-order permutations (the training condition) but less so under isolated label-only or order-only perturbations, suggesting it learns training-set specific biases rather than true semantic invariance.
The comparison against baselines is generally fair but mixes methodologies: inference-time methods (PriDe, CalibraEval, UniBias) require no training but add computational overhead at inference, while PIF (supervised fine-tuning) and GRPO (RL) are training-time methods like PA-GRPO. The comparisons with PIF and GRPO are the most relevant, and PA-GRPO consistently outperforms these, particularly on consistency metrics. However, the comparison with inference-time methods should acknowledge that PA-GRPO's gains come at the cost of expensive RL training with group sampling. Table 1 shows PA-GRPO achieves superior Consistent Accuracy (e.g., 75.0% vs 73.0% for GRPO on TinyMMLU for Llama-3.1-8B), though some benchmarks show minor accuracy trade-offs (e.g., RewardBench drops from 76.3% to 75.8% for GRPO vs PA-GRPO on Llama-3.1-8B). The evidence supports the claim that PA-GRPO reduces selection bias, but the magnitude of improvement over strong baselines like PIF is sometimes modest (e.g., 1-3% absolute on CA).
The paper provides substantial implementation detail that aids reproduction. Training uses the open-source verl RL framework with LoRA ($r=32, \alpha=64$), and hyperparameters are specified in Appendix F: AdamW optimizer with learning rate $10^{-5}$, KL coefficient $\beta=0.001$, and batch sizes of 40 (MCQ) or 32 (Judge). The authors describe their data filtering protocol—retaining only instances where the base model shows inconsistent predictions across permutations—which is critical for reproducing the training set but introduces a dependency on the specific base model's initial behavior. The paper states code will be made available on GitHub, but as of submission, it is not yet accessible. Full permutation evaluation at test time (using all $N!$ permutations) is clearly defined and should be reproducible, though computationally expensive (24 evaluations per MCQ instance).
Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.