Recursive Language Models
Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.
The paper presents a well-motivated and empirically promising inference paradigm for scaling LLM context windows. The core insight—that offloading prompts to a REPL environment enables symbolic recursion—yields substantial improvements on challenging multi-hop and aggregation tasks. However, several claims require scrutiny: the training efficiency (28.3% gain from 1,000 samples), the small scale of the novel OOLONG-Pairs benchmark (only 20 queries), and inconsistencies in reported context lengths for Qwen3-8B (32K in Appendix C vs 128K in the official tech report). Additionally, cost comparisons rely on asymmetrical model choices (GPT-5-mini for sub-calls vs GPT-5 for baselines) that complicate fairness assessments.
The theoretical framing of RLMs as providing enhanced expressive power through symbolic recursion is well-articulated. The comparison between Algorithm 1 (RLM) and Algorithm 2 (naive scaffold) clearly isolates three critical design choices: symbolic prompt handles, output construction through variables, and programmatic recursion. The empirical results on BrowseComp-Plus are particularly striking: RLM(GPT-5) achieves 91.3% accuracy versus 0% for the base model and 70.5% for the summary agent. The cost analysis showing comparable median costs with higher variance is transparent about the trade-offs. The emergence of coherent chunking strategies (e.g., "RLM(Qwen3-Coder) chunks by newline in a 1000+ line context from OOLONG") demonstrates that the approach yields interpretable behaviors.
Four issues undermine confidence in the claims. First, the training data correction process for RLM-Qwen3-8B is underspecified: "16% of turns cleaned incorrectly used FINAL answers, and 13% of turns incorrectly called a variable from the REPL"—yet the exact correction logic is not provided. Second, the novel OOLONG-Pairs benchmark contains only 20 queries, making robust statistical claims difficult. Third, the context length for Qwen3-8B is reported as ~32K tokens in Appendix C but 128K in the official Qwen3 technical report (Table 1), creating confusion about baseline capabilities. Fourth, cost comparisons use GPT-5-mini for RLM sub-calls ($0.99) against full GPT-5 base model runs—an asymmetry that artificially inflates RLM cost-effectiveness when sub-calls could instead use the same base model.
The evidence supports the core claim that RLMs scale context processing, but comparisons warrant scrutiny. The RLM(GPT-5) vs base model comparison on OOLONG-Pairs (58.0% vs 0.1% F1) is compelling, though the baseline failure rate suggests the task may be artificially difficult for single-call models. The CodeAct baseline with BM25 performs surprisingly poorly on BrowseComp+ (51.0% vs 91.3%), raising questions about implementation quality rather than conceptual superiority. The 'no sub-calls' ablation shows REPL alone provides substantial benefits (66.0% vs 62.0% on CodeQA for Qwen3-Coder), suggesting the recursive aspect may be less critical for some tasks. Related work comparisons are generally fair though THREAD, ReDel, and Context Folding are mentioned as related approaches without direct empirical comparison.
Reproducibility has mixed prospects. Positives: (1) Code is available at github.com/alexzhang13/rlm; (2) The REPL system prompts are provided in full in Appendix C (including base64-encoded diffs); (3) BrowseComp-Plus is a public benchmark with verified documents. Blockers: (1) GPT-5 and GPT-5-mini are proprietary with limited reproducibility; (2) The 'programmatic correction step' for training data is not described with sufficient detail to replicate; (3) No dataset of the 1,000 training trajectories is released, and 'LongBenchPro' tasks used for training are not described with version or date information; (4) No hyperparameters for the fine-tuning (learning rate, learning rate schedule, warmup steps) are provided beyond 'batch size of 64 for 300 training steps'; (5) The random seeds for evaluation and exact prompt templates for sub-calls are not specified.
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.