ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.13M parameter head attached to late-layer hidden states of a frozen LLM—that predicts overthinking probability token-by-token and triggers early stopping. It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.
ROM presents a sound technical contribution in formulating overthinking mitigation as a streaming prediction problem. The approach is well-motivated: instead of expensive RL-based training or brittle entropy heuristics, it learns from correctness-boundary supervision anchored at the First Correct Solution (FCS). The Counterfactual Self-Correction (CSC) augmentation effectively addresses the first-solution bias in distilled training data. However, the training set is small (~1,533 samples), and the paper admits that performance does not scale noticeably with more data—a concerning limitation for generalization. The reliance on GPT-4o for labeling also creates a reproducibility barrier.
The streaming detection paradigm is the strongest contribution. By monitoring late-layer hidden states with a CfC recurrent cell and attention-pooled prefix summaries, ROM captures phase changes in reasoning that entropy-based methods miss. The boundary-aware intervention with backtracing (rewinding to clean sentence boundaries) is crucial—without it, mid-sentence cuts cause the model to generate compensatory explanations, negating savings. The empirical gains are substantial: 93.51% accuracy with 1,159 tokens versus vanilla’s 91.72% with 2,197 tokens, and a 52.7% efficiency margin over the RL-based L1 baseline.
First, the training data is minimal (740 efficient + 793 overthinking samples) and derived from a single labeling model (QwQ for segmentation, GPT-4o for verification). The paper acknowledges that scaling to larger datasets yields no improvement—suggesting the detector may be underparametrized or the feature space saturated. Second, the 0.5 probability threshold for intervention is arbitrary; no sensitivity analysis is provided. Third, while the detector adds only 0.10% parameters, it requires accessing layer-32 hidden states of Qwen3-8B—generalization to other architectures is untested. Finally, the decoupled training (labels from QwQ, hidden states from Qwen3-8B) creates a mismatch that could limit performance on genuine self-correction trajectories.
The evidence supports the central claim that streaming detection outperforms both RL-based and heuristic baselines. The comparison with L1 (Aggarwal & Welleck, 2025) is fair: L1 requires fine-tuning the 8B backbone on 40K samples, while ROM trains only 8.13M parameters on 1.5K samples. The efficiency metric $SE = \text{Acc}/\text{SL} \times 100$ favors ROM’s instance-adaptive control over L1’s global constraint. However, the comparison with EAT (entropy-based) reveals an apples-to-oranges issue: EAT is entirely training-free, while ROM requires supervised training. The 7-dataset evaluation is comprehensive, though MMLU-Pro results (77.10%) show smaller gains over vanilla (76.67%) than math tasks, suggesting the detector transfers less effectively to multiple-choice domains.
Code is available at https://github.com/SaFo-Lab/ROM. Implementation details are thorough: layer 32 of Qwen3-8B, dp=1024 for AttnProj, CfC cell for temporal modeling. However, reproduction is blocked by the GPT-4o labeling dependency—the paper notes that open alternatives like Llama-3.3-70B yielded noticeably lower labeling accuracy. Pre-computing hidden states (Section 3.5) reduces training cost but requires substantial storage. Hyperparameters for the CfC cell and training (optimizer, learning rate, batch size) are omitted. The small training size (1.5K samples) aids reproducibility but raises generalization concerns for other backbones.
Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.