Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization
Standard Transformers apply fixed-depth computation regardless of problem difficulty, limiting their ability to solve tasks requiring variable-depth reasoning like multi-hop traversal or nested logic. This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context window. The work demonstrates strong compositional generalization on three synthetic tasks and offers a mechanistic alternative to horizontal token-generation paradigms.
The paper presents a cleanly architected and empirically validated approach to variable-depth reasoning in Transformers. The core contribution—decoupling computational depth from parameter count via latent-space recurrence—is both theoretically motivated and practically demonstrated through a progression of tasks with decreasing inductive biases. However, the evaluation remains limited to small-scale synthetic settings (<1M parameters), and the abrupt collapse observed at 10 hops in graph reachability (despite perfect generalization to 8 hops) suggests fragility that warrants deeper investigation.
The three stabilization mechanisms—silent thinking (final-step-only supervision), LayerScale initialization ($\Gamma_i=10^{-4}$), and identity-biased recurrence ($b_z=-2.0$)—are well-motivated and appear effective for unrolling 20+ steps. The ablation comparing silent thinking against intermediate supervision is particularly compelling: models trained with per-step losses learn statistical shortcuts (achieving 73% accuracy on 12-hop paths in one step, which is physically impossible under strict adjacency masking), while silent thinking forces genuine algorithmic learning. The progression from topological masking to unstructured text provides a nuanced view of how perception interfaces shape generalization.
The evaluation is restricted to small synthetic tasks with fewer than 1M parameters, raising questions about scalability to real-world language modeling. The abrupt collapse at 10 hops in the graph task—despite stable performance at 8 hops—remains unexplained and suggests the 'computational frontier' may be more brittle than the diagonal heatmap narrative implies. The perception interfaces are hand-engineered for each task, undermining claims about autonomous latent routing. Furthermore, the paper positions itself against Chain-of-Thought prompting but provides no direct comparison to actual LLM baselines or standard CoT performance on these tasks, making it difficult to assess practical significance.
The evidence supports the specific claims about the computational frontier phenomenon and the dangers of intermediate supervision on these synthetic tasks. However, comparisons to related work lack empirical grounding: while the paper distinguishes itself from Universal Transformers and Coconut on architectural grounds, no direct empirical baselines are provided to quantify these differences. The theoretical claims about bypassing $\mathsf{TC}^0$ limitations are cited from Merrill and Sabharwal (2024) but not formally proven for this specific recurrent architecture. The OOD generalization claims are modest—referring to longer paths within the same task distribution rather than cross-task transfer.
The architectural details are comprehensively documented in Appendix A and Table 2, including hyperparameters ($d=128/256$, $h=4/8$, LayerScale init $10^{-4}$, gate bias $-2.0$) and initialization schemes. The tasks—graph reachability, nested boolean logic, and CLUTRR-style relational composition—are synthetic and should be reproducible from the descriptions provided. However, no code repository or data generation scripts are mentioned in the provided text, which would be necessary for exact reproduction. The reliance on specific manually-designed perception interfaces for each task introduces implementation details that may not be fully recoverable from the text alone.
Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.