ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

cs.LG cs.AI cs.CL Xinyan Wang, Xiaogeng Liu, Chaowei Xiao · Mar 23, 2026
Local to this browser
What it does
ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.
Why it matters
It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.
Main concern
ROM presents a sound technical contribution in formulating overthinking mitigation as a streaming prediction problem. The approach is well-motivated: instead of expensive RL-based training or brittle entropy heuristics, it learns from...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.13M parameter head attached to late-layer hidden states of a frozen LLM—that predicts overthinking probability token-by-token and triggers early stopping. It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.

Critical review
Verdict
Bottom line

ROM presents a sound technical contribution in formulating overthinking mitigation as a streaming prediction problem. The approach is well-motivated: instead of expensive RL-based training or brittle entropy heuristics, it learns from correctness-boundary supervision anchored at the First Correct Solution (FCS). The Counterfactual Self-Correction (CSC) augmentation effectively addresses the first-solution bias in distilled training data. However, the training set is small (~1,533 samples), and the paper admits that performance does not scale noticeably with more data—a concerning limitation for generalization. The reliance on GPT-4o for labeling also creates a reproducibility barrier.

“CSC improves accuracy from 92.53% to 92.65% (+0.12%), reduces response length from 1421 to 1209 tokens (-14.9%), and boosts efficiency from SE=9.14 to SE=11.56 (+26.5%)”
ROM paper · Section 4.3.1
“Our method does not exhibit a pronounced scaling law with respect to training data size: performance on 50% of the training set is comparable to that on the full dataset”
ROM paper · Section 5
What holds up

The streaming detection paradigm is the strongest contribution. By monitoring late-layer hidden states with a CfC recurrent cell and attention-pooled prefix summaries, ROM captures phase changes in reasoning that entropy-based methods miss. The boundary-aware intervention with backtracing (rewinding to clean sentence boundaries) is crucial—without it, mid-sentence cuts cause the model to generate compensatory explanations, negating savings. The empirical gains are substantial: 93.51% accuracy with 1,159 tokens versus vanilla’s 91.72% with 2,197 tokens, and a 52.7% efficiency margin over the RL-based L1 baseline.

“Naively cutting at $t^{*}$ can break formatting or truncate mid-sentence. We therefore backtrace to the nearest clean boundary”
ROM paper · Section 3.4
“ROM$_{\text{CSC}}$ achieves comparable accuracy (93.51% vs. 93.47%) but with 37.1% shorter responses (1159 vs. 1843 tokens) and 97.6% higher efficiency (SE=12.37 vs. 6.26)”
ROM paper · Section 4.2.3
Main concerns

First, the training data is minimal (740 efficient + 793 overthinking samples) and derived from a single labeling model (QwQ for segmentation, GPT-4o for verification). The paper acknowledges that scaling to larger datasets yields no improvement—suggesting the detector may be underparametrized or the feature space saturated. Second, the 0.5 probability threshold for intervention is arbitrary; no sensitivity analysis is provided. Third, while the detector adds only 0.10% parameters, it requires accessing layer-32 hidden states of Qwen3-8B—generalization to other architectures is untested. Finally, the decoupled training (labels from QwQ, hidden states from Qwen3-8B) creates a mismatch that could limit performance on genuine self-correction trajectories.

“Dependence on Labeling Model Quality. Our data generation pipeline relies on a (relatively) high-capability model for automatic solution labeling”
ROM paper · Section 5
“We train the detection head for 20 epochs on 740 efficient and 793 overthinking samples”
ROM paper · Section 3.5
“We trigger intervention at the first step where the predicted overthinking probability exceeds 0.5”
ROM paper · Section 3.4
Evidence and comparison

The evidence supports the central claim that streaming detection outperforms both RL-based and heuristic baselines. The comparison with L1 (Aggarwal & Welleck, 2025) is fair: L1 requires fine-tuning the 8B backbone on 40K samples, while ROM trains only 8.13M parameters on 1.5K samples. The efficiency metric $SE = \text{Acc}/\text{SL} \times 100$ favors ROM’s instance-adaptive control over L1’s global constraint. However, the comparison with EAT (entropy-based) reveals an apples-to-oranges issue: EAT is entirely training-free, while ROM requires supervised training. The 7-dataset evaluation is comprehensive, though MMLU-Pro results (77.10%) show smaller gains over vanilla (76.67%) than math tasks, suggesting the detector transfers less effectively to multiple-choice domains.

“L1 uses RL on 40K samples to fine-tune Qwen3-8B. ROM$_{\text{CSC}}$ keeps the backbone frozen and trains only a lightweight detection head on 1,533 samples”
ROM paper · Section 4.2.2
“We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints”
L1 paper · Abstract
Reproducibility

Code is available at https://github.com/SaFo-Lab/ROM. Implementation details are thorough: layer 32 of Qwen3-8B, dp=1024 for AttnProj, CfC cell for temporal modeling. However, reproduction is blocked by the GPT-4o labeling dependency—the paper notes that open alternatives like Llama-3.3-70B yielded noticeably lower labeling accuracy. Pre-computing hidden states (Section 3.5) reduces training cost but requires substantial storage. Hyperparameters for the CfC cell and training (optimizer, learning rate, batch size) are omitted. The small training size (1.5K samples) aids reproducibility but raises generalization concerns for other backbones.

“Code is available at https://github.com/SaFo-Lab/ROM”
ROM paper · Abstract
“We initially explored open-source alternatives such as Llama-3.3-70B-Instruct for labeling, but found its labeling accuracy to be noticeably lower than that of GPT-4o”
ROM paper · Section 5
“To reduce compute, we pre-compute and cache $\{\mathbf{h}_{t}\}$ for all training samples, so detector training does not require repeated backbone forward passes”
ROM paper · Section 3.5
Abstract

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.