When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

cs.CV cs.AI Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang · Mar 22, 2026
Local to this browser
What it does
Current multimodal large language models rely on expensive annotated data or teacher distillation for reasoning improvements. This paper proposes an unsupervised self-evolution framework that trains without ground-truth labels or external...
Why it matters
The method employs group-wise distributional modeling using Group Relative Policy Optimization (GRPO) to convert absolute scores into relative advantages, achieving up to +5. 9 absolute accuracy gains on MathVision while maintaining...
Main concern
The paper presents a compelling unsupervised reinforcement learning framework for multimodal reasoning that effectively mitigates mode collapse through theoretically grounded distributional modeling. The Actor-Judge design with bounded...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Current multimodal large language models rely on expensive annotated data or teacher distillation for reasoning improvements. This paper proposes an unsupervised self-evolution framework that trains without ground-truth labels or external reward models by instantiating dual roles—an Actor that generates multiple reasoning trajectories and a frozen Judge that modulates consistency-based rewards. The method employs group-wise distributional modeling using Group Relative Policy Optimization (GRPO) to convert absolute scores into relative advantages, achieving up to +5.9 absolute accuracy gains on MathVision while maintaining healthier training entropy than majority-voting baselines.

Critical review
Verdict
Bottom line

The paper presents a compelling unsupervised reinforcement learning framework for multimodal reasoning that effectively mitigates mode collapse through theoretically grounded distributional modeling. The Actor-Judge design with bounded modulation $g(s)$ provides a principled solution to pseudo-consistency in self-evolution. However, the frozen Judge imposes a hard capability ceiling that limits sustained improvement, and the documented drop in pass@10 metrics suggests the method trades exploration diversity for task accuracy.

“this work primarily investigates the construction of stable training signals, and leaves the question of how to further improve the self-evolving system beyond the Judge's capability limit to future study”
paper · Limitations section
What holds up

The group-wise distributional modeling is the strongest theoretical contribution. Appendix A demonstrates that the log-sum-exp baseline induces a target distribution $q_\alpha(\tau_k|x) = \frac{\exp(\alpha R_k)}{\sum_j \exp(\alpha R_j)}$, where the policy update minimizes $D_{KL}(q_\alpha(\cdot|x) \| \pi_\theta(\cdot|x))$. This prevents deterministic collapse when multiple candidates have comparable rewards, unlike majority voting which converges to a one-hot target. The bounded Judge modulation $g(s) = 1 + \lambda_+ \sigma(\frac{s-t_h}{\tau_h}) - \lambda_- \sigma(\frac{t_l-s}{\tau_l})$ is carefully designed to provide continuous quality calibration without over-relying on raw Judge scores, addressing the pseudo-consistency problem where high consistency does not imply high quality.

“the optimal policy approaches the reward-induced distribution $q_{\alpha}(\cdot\mid x)$”
paper · Appendix A, Eq. A.8
“$g(s)=1+\lambda_{+}\,\sigma\!\Big(\frac{s-t_{h}}{\tau_{h}}\Big)-\lambda_{-}\,\sigma\!\Big(\frac{t_{l}-s}{\tau_{l}}\Big)$”
paper · Section 3.2, Eq. 7
Main concerns

The frozen Judge design creates a fundamental bottleneck. Since the Judge is initialized from the Actor and kept fixed throughout training, it cannot improve its evaluation standards as the Actor evolves, creating a "capability limit" acknowledged in the Limitations section. Empirically, Table 4 shows pass@10 drops from 0.66 to 0.64 on MathVision despite accuracy improvements, indicating reduced output diversity and exploration. The method also exhibits vulnerability to "incorrect consensus" scenarios described in Appendix F, where both the Actor's self-consistency distribution and the Judge favor the same incorrect answer, causing the model to "update in an undesired direction" and reinforcing errors.

“Qwen2.5-VL-7B: 0.66 ... Ours: 0.64”
paper · Table 4
“under this 'incorrect consensus' scenario, the group-relative reward modeling will continue to favor the wrong trajectory”
paper · Appendix F
“the Judge should be able to progressively raise its evaluation standards as training proceeds”
paper · Section 5 (Limitations)
Evidence and comparison

The evidence supports the core claims against relevant baselines. Table 1 demonstrates consistent improvements over MM-UPT (majority voting) and comparable performance to supervised GRPO without using labels. Ablations in Table 2 validate that combining Self-Consistency with Judge Scoring (SC + JS) yields +4.9 average improvement versus +1.8 for majority voting alone. However, the evaluation scope is primarily limited to mathematical reasoning benchmarks; while Table 7 shows generalization to ChartQA and MMVP, broader multimodal capabilities remain untested. The comparison to supervised methods in Table 1 is fair, though the claim of "sustained self-improvement" is limited by the single-iteration Judge constraint.

“On MathVision, our method achieves an absolute improvement of up to 5.9 points (30.9 vs. 25.0)”
paper · Table 1
“+ SC + JS (Dist.): 27.55+4.9”
paper · Table 2
Reproducibility

The paper provides strong reproducibility support with code available at the project website. Appendix E details all hyperparameters including learning rate $1\times 10^{-6}$, KL coefficient $\beta=0.01$, group size $n=8$, and Judge calibration thresholds $t_h=0.95, t_l=0.40$. The Judge evaluation prompt (Appendix D) is fully specified with strict JSON output format constraints and scoring rubrics. However, the computational requirement of 8xA800 GPUs and 1.4x training time overhead relative to supervised GRPO (Table 6) may limit accessibility. Additionally, the sensitivity of the frozen Judge to its initialization and prompt engineering is not fully ablated.

“Optimizer Learning Rate: $1\times 10^{-6}$ ... KL Loss Coefficient ($\beta$): 0.01 ... Rollout Group Size ($n$): 8”
paper · Appendix E, Table 8
“Ours: 1.4$\times$ relative time”
paper · Table 6
“Judge Prompt ... You are an expert evaluator for multimodal mathematical reasoning”
paper · Appendix D
Abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.