A transformer architecture alteration to incentivise externalised reasoning

cs.AI Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard · Mar 22, 2026

What it does

Why it matters

This paper proposes an early-exit transformer architecture that trains models to truncate forward passes at intermediate layers when tokens are predictable, aiming to force reasoning into externalized CoT rather than internal activations....

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

LLMs with chain-of-thought reasoning can perform hidden internal computations across filler tokens, threatening AI safety by enabling obfuscated reasoning. This paper proposes an early-exit transformer architecture that trains models to truncate forward passes at intermediate layers when tokens are predictable, aiming to force reasoning into externalized CoT rather than internal activations. The approach uses self-distillation to calibrate exit probabilities followed by RL with a layer-depth penalty, showing on small models that adaptive depth reduction can maintain task performance while reducing computation.

Critical review

Verdict

Bottom line

This is a preliminary proof-of-concept demonstrating that early-exit mechanisms can be trained via RL to reduce per-token compute usage while maintaining task performance on small reasoning models. However, the paper falls critically short of its central safety claim: it provides no evidence that the mechanism actually improves CoT monitorability or faithfulness. The authors explicitly state "We leave the evaluation of faithfulness for future work," yet frame the contribution as addressing obfuscation risks. With only 5-10% average layer reduction achieved in experiments, the practical constraint on hidden computation remains unproven.

“We stop short of verifying the intended effects on CoT monitorability.”

Pavlova et al. · Section 3

“We leave the evaluation of faithfulness for future work.”

Pavlova et al. · Section 1

What holds up

The technical implementation is sound. The self-distillation calibration using KL divergence between intermediate and final layer logits provides a principled way to identify when early exits are safe. The RLOO variant with the explicit layer penalty term $-\lambda \cdot (\text{average exit layer index})$ correctly incentivizes earlier exits. Figure 2 demonstrates that models can indeed learn adaptive depth variation, using fewer layers for predictable tokens (like "-talk" in "walkie-talkie") while preserving full depth for complex reasoning tokens. The stochastic exit mechanism (sampling from learned distributions rather than thresholding) enables effective RL exploration.

“$R_{\text{total}}=R_{\text{task}}-\lambda\cdot(\text{average exit layer index})-\beta\cdot D_{\text{KL}}(\text{policy}\,\|\,\text{base model})$”

Pavlova et al. · Section 2, Eq. 1

“The model adoptively varies computational depth, using fewer layers for predictable tokens and the full depth for more complex ones.”

Pavlova et al. · Section 3, Figure 2 caption

Main concerns

First, the paper claims to address "unfaithfulness" and "obfuscation" in CoT monitoring, but never validates that early exiting actually forces externalization of reasoning versus simply degrading the model's reasoning capability. The core safety hypothesis—that constraining internal compute pushes reasoning into monitorable tokens—remains untested. Second, the achieved compute reduction is modest: average layer usage only drops from 98% to approximately 90-95% in RL experiments, which may insufficiently constrain non-myopic planning. Third, the mechanism assumes that shallow layers can substitute for deeper computation when tokens are predictable, but doesn't rule out that models use the 'saved' computation in other ways or that early exits harm reasoning quality in subtle ways not captured by accuracy metrics. The coherence scores sitting "slightly below the unmodified baseline" suggest some quality degradation.

“Obfuscated reasoning is facilitated by the fact that transformers are non-myopic: each forward pass is optimised not just for the immediate next token, but for subsequent predictions”

Pavlova et al. · Section 1

“Average layer compute decreases from 98% to approximately 95% with some runs dropping to a 90% average”

Pavlova et al. · Section 3

“Answer coherence is stable across RL training but sits slightly below the unmodified baseline”

Pavlova et al. · Section 3

Evidence and comparison

The evidence supports the technical feasibility of training early-exit mechanisms but does not substantiate the safety claims. The citation of Pfau et al. 2024 regarding hidden computation in filler tokens is apt and verified: "The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations." However, the paper lacks comparison to alternative approaches for improving CoT faithfulness, such as probing-based detection of hidden states or supervised fine-tuning on faithful reasoning traces. The evaluation is limited to two domains (GSM8K and Theory of Mind) on small models (1.5B and 4B parameters), with the authors noting they "still need to run cross-domain validation" and test on larger models.

“Our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.”

Pfau et al. · arXiv:2404.15758, Abstract

“Our preliminary evaluations have been on within-task performance for GSM8K and theory of mind data and we still need to run cross-domain validation to evaluate generalisation of our early-exit architecture”

Pavlova et al. · Section 4

Reproducibility

Reproducibility is significantly hampered by missing implementation details. The paper does not specify critical hyperparameters including LoRA rank, learning rates, batch sizes, or the exact RLOO implementation details. The KL factor values tested (0.25 to 4.0) and selected (1.0) are provided, but the beta value for KL regularization (0.25) appears only in a figure caption. No code, model weights, or dataset splits are publicly released. The reference to "Appendix A" for complete training details reveals only high-level descriptions of the loss function rather than runnable specifications. The GPT-5-based coherence evaluation (scoring on four dimensions 1-10) introduces non-determinism without reporting judge temperature or version.

“We used $\lambda=1.5$ and $\beta=0.25$”

Pavlova et al. · Figure 2 caption

“KL factor 0.25 achieves 59% early exit rate but coherence score of 1.1 versus base model 8.9”

Pavlova et al. · Appendix A.1, Table 1

Abstract

We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.