Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

cs.LG cs.AI cs.CL Hung-Hsuan Chen · Mar 23, 2026

What it does

Why it matters

This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Standard Transformers apply fixed-depth computation regardless of problem difficulty, limiting their ability to solve tasks requiring variable-depth reasoning like multi-hop traversal or nested logic. This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context window. The work demonstrates strong compositional generalization on three synthetic tasks and offers a mechanistic alternative to horizontal token-generation paradigms.

Critical review

Verdict

Bottom line

The paper presents a cleanly architected and empirically validated approach to variable-depth reasoning in Transformers. The core contribution—decoupling computational depth from parameter count via latent-space recurrence—is both theoretically motivated and practically demonstrated through a progression of tasks with decreasing inductive biases. However, the evaluation remains limited to small-scale synthetic settings (<1M parameters), and the abrupt collapse observed at 10 hops in graph reachability (despite perfect generalization to 8 hops) suggests fragility that warrants deeper investigation.

“We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space—enabling the model to trade recurrence steps for deeper reasoning at inference time.”

paper · Abstract

“The model achieves 100% OOD generalization up to 8 hops (1.6×), but collapses abruptly at 10 hops, indicating a clear, rigid generalization boundary enforced by the topological masking.”

paper · Section 4.1

What holds up

The three stabilization mechanisms—silent thinking (final-step-only supervision), LayerScale initialization ($\Gamma_i=10^{-4}$), and identity-biased recurrence ($b_z=-2.0$)—are well-motivated and appear effective for unrolling 20+ steps. The ablation comparing silent thinking against intermediate supervision is particularly compelling: models trained with per-step losses learn statistical shortcuts (achieving 73% accuracy on 12-hop paths in one step, which is physically impossible under strict adjacency masking), while silent thinking forces genuine algorithmic learning. The progression from topological masking to unstructured text provides a nuanced view of how perception interfaces shape generalization.

“We apply supervision only at the final recurrence step, with no intermediate auxiliary losses. This forces the model to develop genuine multi-step reasoning paths rather than learning heuristic shortcuts that satisfy per-step supervision.”

paper · Section 3.1

“Intermediate supervision exhibits an apparent anomaly: it achieves over 70% accuracy on 12-hop paths after only a single thinking step. Under strict topological masking, a 1-step model has absolutely no information about nodes 12 hops away; true accuracy must be bounded near 50%.”

paper · Section 4.4

Main concerns

The evaluation is restricted to small synthetic tasks with fewer than 1M parameters, raising questions about scalability to real-world language modeling. The abrupt collapse at 10 hops in the graph task—despite stable performance at 8 hops—remains unexplained and suggests the 'computational frontier' may be more brittle than the diagonal heatmap narrative implies. The perception interfaces are hand-engineered for each task, undermining claims about autonomous latent routing. Furthermore, the paper positions itself against Chain-of-Thought prompting but provides no direct comparison to actual LLM baselines or standard CoT performance on these tasks, making it difficult to assess practical significance.

“The model achieves 100% OOD generalization up to 8 hops (1.6×), but collapses abruptly at 10 hops.”

paper · Section 4.1

“We acknowledge several limitations. First, we use relatively small models (<<1M parameters). Second, the perception interfaces are designed manually.”

paper · Section 5 (Conclusion)

Evidence and comparison

The evidence supports the specific claims about the computational frontier phenomenon and the dangers of intermediate supervision on these synthetic tasks. However, comparisons to related work lack empirical grounding: while the paper distinguishes itself from Universal Transformers and Coconut on architectural grounds, no direct empirical baselines are provided to quantify these differences. The theoretical claims about bypassing $\mathsf{TC}^0$ limitations are cited from Merrill and Sabharwal (2024) but not formally proven for this specific recurrent architecture. The OOD generalization claims are modest—referring to longer paths within the same task distribution rather than cross-task transfer.

“Our work builds on this foundation but differs in several critical aspects. First, we use final-step-only supervision (silent thinking) rather than per-step losses, which we show empirically avoids heuristic shortcut learning.”

paper · Section 2.3

“By making depth dynamically variable through recurrence, our architecture natively bypasses the $\mathsf{TC}^0$ limitation.”

paper · Section 2.4

Reproducibility

The architectural details are comprehensively documented in Appendix A and Table 2, including hyperparameters ($d=128/256$, $h=4/8$, LayerScale init $10^{-4}$, gate bias $-2.0$) and initialization schemes. The tasks—graph reachability, nested boolean logic, and CLUTRR-style relational composition—are synthetic and should be reproducible from the descriptions provided. However, no code repository or data generation scripts are mentioned in the provided text, which would be necessary for exact reproduction. The reliance on specific manually-designed perception interfaces for each task introduces implementation details that may not be fully recoverable from the text alone.

“Graph: d=128, h=4, d_ff=256, LayerScale=×; Nested Expr.: d=256, h=8, d_ff=1024, LayerScale=✓; Gate bias b_z=-2.0 for all.”

paper · Appendix A, Table 2

“Third, we do not provide formal theoretical guarantees on the generalization bound; our evidence is empirical.”

paper · Section 5 (Conclusion)

Abstract

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.