A transformer architecture alteration to incentivise externalised reasoning
LLMs with chain-of-thought reasoning can perform hidden internal computations across filler tokens, threatening AI safety by enabling obfuscated reasoning. This paper proposes an early-exit transformer architecture that trains models to truncate forward passes at intermediate layers when tokens are predictable, aiming to force reasoning into externalized CoT rather than internal activations. The approach uses self-distillation to calibrate exit probabilities followed by RL with a layer-depth penalty, showing on small models that adaptive depth reduction can maintain task performance while reducing computation.
This is a preliminary proof-of-concept demonstrating that early-exit mechanisms can be trained via RL to reduce per-token compute usage while maintaining task performance on small reasoning models. However, the paper falls critically short of its central safety claim: it provides no evidence that the mechanism actually improves CoT monitorability or faithfulness. The authors explicitly state "We leave the evaluation of faithfulness for future work," yet frame the contribution as addressing obfuscation risks. With only 5-10% average layer reduction achieved in experiments, the practical constraint on hidden computation remains unproven.
The technical implementation is sound. The self-distillation calibration using KL divergence between intermediate and final layer logits provides a principled way to identify when early exits are safe. The RLOO variant with the explicit layer penalty term $-\lambda \cdot (\text{average exit layer index})$ correctly incentivizes earlier exits. Figure 2 demonstrates that models can indeed learn adaptive depth variation, using fewer layers for predictable tokens (like "-talk" in "walkie-talkie") while preserving full depth for complex reasoning tokens. The stochastic exit mechanism (sampling from learned distributions rather than thresholding) enables effective RL exploration.
First, the paper claims to address "unfaithfulness" and "obfuscation" in CoT monitoring, but never validates that early exiting actually forces externalization of reasoning versus simply degrading the model's reasoning capability. The core safety hypothesis—that constraining internal compute pushes reasoning into monitorable tokens—remains untested. Second, the achieved compute reduction is modest: average layer usage only drops from 98% to approximately 90-95% in RL experiments, which may insufficiently constrain non-myopic planning. Third, the mechanism assumes that shallow layers can substitute for deeper computation when tokens are predictable, but doesn't rule out that models use the 'saved' computation in other ways or that early exits harm reasoning quality in subtle ways not captured by accuracy metrics. The coherence scores sitting "slightly below the unmodified baseline" suggest some quality degradation.
The evidence supports the technical feasibility of training early-exit mechanisms but does not substantiate the safety claims. The citation of Pfau et al. 2024 regarding hidden computation in filler tokens is apt and verified: "The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations." However, the paper lacks comparison to alternative approaches for improving CoT faithfulness, such as probing-based detection of hidden states or supervised fine-tuning on faithful reasoning traces. The evaluation is limited to two domains (GSM8K and Theory of Mind) on small models (1.5B and 4B parameters), with the authors noting they "still need to run cross-domain validation" and test on larger models.
Reproducibility is significantly hampered by missing implementation details. The paper does not specify critical hyperparameters including LoRA rank, learning rates, batch sizes, or the exact RLOO implementation details. The KL factor values tested (0.25 to 4.0) and selected (1.0) are provided, but the beta value for KL regularization (0.25) appears only in a figure caption. No code, model weights, or dataset splits are publicly released. The reference to "Appendix A" for complete training details reveals only high-level descriptions of the loss function rather than runnable specifications. The GPT-5-based coherence evaluation (scoring on four dimensions 1-10) introduces non-determinism without reporting judge temperature or version.
We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.