Recursive Language Models

cs.AI cs.CL cs.AI Alex L. Zhang, Tim Kraska, Omar Khattab · Dec 31, 2025

What it does

Why it matters

Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.

Critical review

Verdict

Bottom line

The paper presents a well-motivated and empirically promising inference paradigm for scaling LLM context windows. The core insight—that offloading prompts to a REPL environment enables symbolic recursion—yields substantial improvements on challenging multi-hop and aggregation tasks. However, several claims require scrutiny: the training efficiency (28.3% gain from 1,000 samples), the small scale of the novel OOLONG-Pairs benchmark (only 20 queries), and inconsistencies in reported context lengths for Qwen3-8B (32K in Appendix C vs 128K in the official tech report). Additionally, cost comparisons rely on asymmetrical model choices (GPT-5-mini for sub-calls vs GPT-5 for baselines) that complicate fairness assessments.

“RLM-Qwen3-8B, a Qwen3-8B model that we fine-tuned on RLM(Qwen3-Coder-480B-A35B) trajectories... considerably outperforms the base Qwen3-8B as an RLM by 28.3% on average”

Zhang et al., RLM paper · Section 4, Table 1

What holds up

The theoretical framing of RLMs as providing enhanced expressive power through symbolic recursion is well-articulated. The comparison between Algorithm 1 (RLM) and Algorithm 2 (naive scaffold) clearly isolates three critical design choices: symbolic prompt handles, output construction through variables, and programmatic recursion. The empirical results on BrowseComp-Plus are particularly striking: RLM(GPT-5) achieves 91.3% accuracy versus 0% for the base model and 70.5% for the summary agent. The cost analysis showing comparable median costs with higher variance is transparent about the trade-offs. The emergence of coherent chunking strategies (e.g., "RLM(Qwen3-Coder) chunks by newline in a 1000+ line context from OOLONG") demonstrates that the approach yields interpretable behaviors.

“An RLM must give the underlying LLM $\mathcal{M}$ a symbolic handle to the user prompt $P$, so the model can manipulate it without copying text into the root context window”

Zhang et al., Sec. 2 · Algorithm 1 vs Algorithm 2

“RLM(GPT-5): 91.3% ($0.99\pm$1.22) vs Base GPT-5: 0.0% (N/A)”

Zhang et al., Table 1 · BrowseComp+ results

Main concerns

Four issues undermine confidence in the claims. First, the training data correction process for RLM-Qwen3-8B is underspecified: "16% of turns cleaned incorrectly used FINAL answers, and 13% of turns incorrectly called a variable from the REPL"—yet the exact correction logic is not provided. Second, the novel OOLONG-Pairs benchmark contains only 20 queries, making robust statistical claims difficult. Third, the context length for Qwen3-8B is reported as ~32K tokens in Appendix C but 128K in the official Qwen3 technical report (Table 1), creating confusion about baseline capabilities. Fourth, cost comparisons use GPT-5-mini for RLM sub-calls ($0.99) against full GPT-5 base model runs—an asymmetry that artificially inflates RLM cost-effectiveness when sub-calls could instead use the same base model.

“We added an extra programmatic fixing step to look for common templated mistakes and patch them”

Zhang et al., Appendix A · Training details

“Qwen3-8B... Context Length: 128K”

Qwen Team, Qwen3 Technical Report · Table 1

“We modify the trec_coarse split of OOLONG to include 20 new queries”

Zhang et al., Appendix D.1 · OOLONG-Pairs

Evidence and comparison

The evidence supports the core claim that RLMs scale context processing, but comparisons warrant scrutiny. The RLM(GPT-5) vs base model comparison on OOLONG-Pairs (58.0% vs 0.1% F1) is compelling, though the baseline failure rate suggests the task may be artificially difficult for single-call models. The CodeAct baseline with BM25 performs surprisingly poorly on BrowseComp+ (51.0% vs 91.3%), raising questions about implementation quality rather than conceptual superiority. The 'no sub-calls' ablation shows REPL alone provides substantial benefits (66.0% vs 62.0% on CodeQA for Qwen3-Coder), suggesting the recursive aspect may be less critical for some tasks. Related work comparisons are generally fair though THREAD, ReDel, and Context Folding are mentioned as related approaches without direct empirical comparison.

“RLM(GPT-5): 58.0% F1 vs Base GPT-5: 0.1% F1”

Zhang et al., Table 1 · OOLONG-Pairs results

“RLM (no sub-calls) with Qwen3-Coder: 66.0% vs RLM: 56.0% on CodeQA”

Zhang et al., Table 1 · Ablation study

Reproducibility

Reproducibility has mixed prospects. Positives: (1) Code is available at github.com/alexzhang13/rlm; (2) The REPL system prompts are provided in full in Appendix C (including base64-encoded diffs); (3) BrowseComp-Plus is a public benchmark with verified documents. Blockers: (1) GPT-5 and GPT-5-mini are proprietary with limited reproducibility; (2) The 'programmatic correction step' for training data is not described with sufficient detail to replicate; (3) No dataset of the 1,000 training trajectories is released, and 'LongBenchPro' tasks used for training are not described with version or date information; (4) No hyperparameters for the fine-tuning (learning rate, learning rate schedule, warmup steps) are provided beyond 'batch size of 64 for 300 training steps'; (5) The random seeds for evaluation and exact prompt templates for sub-calls are not specified.

“We used a batch size of 64 for 300 training steps, training for 48 H100 hours”

Zhang et al., Appendix A · Training details

“The system prompt for RLM with REPL for GPT-5... [base64 encoded content follows]”

Zhang et al., Appendix C · System prompts

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.