Research Paper

Breaking the Context Window: How Recursive Language Models Handle Infinite Input

Long-context understanding has been a persistent challenge in language model research. Despite architectural innovations (ALiBi, YaRN, RoPE variants) and massive context window expansions (Claude 3.5 at 200k tokens, GPT-5 at 256k+), models still exhibit performance degradation on long inputs, a phenomenon known as "context rot." The community has pursued ever-larger context windows, but this approach hits both performance and systems-level walls.

Recursive Language Models (RLMs), a new inference strategy from MIT researchers , takes a fundamentally different approach: instead of forcing models to process unbounded context directly, let them recursively decompose and interact with that context through a REPL environment. The result? RLM(GPT-5-mini) outperforms base GPT-5 by 114% on challenging long-context benchmarks while maintaining comparable costs, and scales to effectively unlimited input lengths (10M+ tokens tested).

The Problem: Context Rot is Real and Poorly Characterized

Here's the paradox: modern frontier models achieve 90%+ accuracy on synthetic benchmarks like RULER's needle-in-a-haystack tests, yet practitioners consistently report degradation in real-world scenarios: extended Claude Code sessions, long ChatGPT conversations, multi-document analysis. The benchmarks don't capture what matters.

Why? Most long-context benchmarks test retrieval of isolated facts from low-information-density "haystacks." Real tasks involve higher complexity: semantic aggregation across thousands of entries, multi-hop reasoning across documents, or maintaining coherence over evolving state. As Anthropic notes in their context engineering documentation, "as the number of tokens in the context window increases, the model's ability to accurately recall information decreases". But this definition misses the interaction between context length and task complexity.

One of the ways to handle this problem has been architectural: make context windows bigger and hope performance holds. But two fundamental problems remain:

Performance degradation persists even within nominal context limits, especially for information-dense tasks
Systems constraints make processing multi-million token contexts impractical despite theoretical support

What if the solution isn't bigger context windows, but smarter decomposition strategies that models control themselves?

How Recursive Language Models Work

Here's the elegant core idea: from the user's perspective, calling an RLM looks identical to calling a regular language model. Using rlm.completion(query, context) provides the same interface as a standard model completion. But under the hood, something fundamentally different is happening.

Figure 1: Recursive Language Models treat prompts as part of their environment

The Context-Centric Approach

Instead of giving the model your entire context upfront, RLMs provide:

Your query to the root language model
An environment (specifically, a Python REPL/notebook) that stores the context as a variable
The ability to recursively spawn new LM calls within that environment

Think of it like this: instead of asking a model to read and memorize a 500-page book before answering a question, you give it the question and let it interactively explore the book, peeking at sections, searching for keywords, and asking sub-questions to itself about specific passages.

The REPL Environment

The choice of Python REPL as the environment is deliberate and powerful. The root LM receives only the query initially, with context stored as a variable it can manipulate programmatically:

Inspection: context[:2000] or context.split('\\n')[100:110]
Filtering: [line for line in context if regex.match(pattern, line)]
Transformation: Parse, chunk, or restructure data before querying
Recursive decomposition: rlm_call(sub_query, context_chunk) spawns isolated sub-instances
State accumulation: Build answers incrementally across iterations

Crucially, the root LM's context window stays lean. It never ingests the full context, only the portions it explicitly requests and the outputs from recursive calls (which are truncated to prevent context bloat).

When confident, the model terminates with FINAL(answer) or FINAL_VAR(variable_name) to return a programmatically constructed result.

The Results: Small Models Outperforming Large Ones

The researchers tested RLMs on two challenging benchmarks where modern frontier models struggle:

OOLONG: Beating GPT-5 with GPT-5-mini

The OOLONG benchmark tests semantic reasoning over thousands of unstructured data entries. For example: "Among instances from users 67144, 53321, 38876... how many should be classified as 'entity'?", but the entries aren't pre-labeled, so the model must infer labels for ~3000-6000 rows.

The results at 132k tokens (well within GPT-5's context window):

GPT-5: 30.2 points
GPT-5-mini: 20.3 points
RLM(GPT-5-mini): 64.7 points (114% improvement over GPT-5!)

And get this: RLM(GPT-5-mini) cost roughly the same per query as regular GPT-5. A smaller model with recursive reasoning beats a larger model with direct access to all context (and does so economically.)

Even at 263k tokens (near the limit for both models), RLM(GPT-5-mini) maintained a 49% improvement over base GPT-5, while the cost per query actually decreased.

Figure 2: A comparison of GPT-5 and a corresponding RLM on three long-context tasks

BrowseComp-Plus: Handling 10M+ Tokens

The researchers created a "Deep Research" task using BrowseComp-Plus, where models must answer multi-hop questions scattered across massive document collections (think 100,000 documents averaging 5k words each.)

At 1,000 documents (~5M+ tokens):

GPT-5 (truncated): ~40% accuracy
GPT-5 + BM25 retrieval: ~60% accuracy
ReAct + GPT-5 + BM25: ~80% accuracy
RLM(GPT-5): 100% accuracy

What's remarkable is that RLM(GPT-5) was the only approach that maintained perfect performance at this scale, and did so without pre-indexing the corpus with traditional retrieval methods. The model dynamically decides how to search and process information at runtime.

What's the Model Actually Doing?

One of the coolest aspects of RLMs is interpretability: you can watch how the model breaks down its task. The researchers observed several emergent strategies:

Peeking: Like a programmer examining a dataset, the model inspects small portions to understand structure:

context[:2000]  # Look at first 2000 characters

Grepping: Using pattern matching to narrow search space:

[line for line in context if 'user_id' in line and 'question' in line]

Partition + Map: Chunking context and spawning parallel recursive calls:

for chunk in split_context(context, size=1000):
    result = recursive_lm(f"Label this chunk: {chunk}")

Summarization: Condensing information from subsets before making decisions.

Programmatic One-Shotting: For tasks like tracking git diffs across 75k+ token histories (where GPT-5 fails 90% of the time), RLMs can programmatically apply changes rather than trying to "think through" each diff.

Figure 3: RLMs have common patterns in their trajectories when solving tasks

Four Key Observations from the Paper

Beyond the impressive benchmark results, the researchers identified four critical patterns that reveal how and when RLMs excel:

Observation 1: RLMs Scale to 10M+ Tokens and Outperform Everything Else

RLMs demonstrated strong performance on inputs well beyond the effective context window of frontier LMs, outperforming base models and common long-context scaffolds by up to 2× while maintaining comparable or cheaper average token costs.

The kicker? On BrowseComp-Plus with 1,000 documents (6-11M input tokens), the cost of using RLM scales comparably to what it would theoretically cost for GPT-5-mini to ingest those tokens directly, except the RLM actually works while direct ingestion fails.

Even on tasks that fit well within the model's context window (like OOLONG at 132k tokens), RLMs make significant improvements over the base model. This suggests context rot happens even when you're technically "within limits."

Observation 2: REPL Enables Scale, Recursion Enables Intelligence

The ablation studies revealed something nuanced: the two components of RLMs serve different purposes.

The REPL environment alone (without recursive sub-calls) enables scaling beyond context limits. Even when the RLM can't spawn sub-queries, offloading context to a variable that the model manipulates programmatically still outperforms base models on most long-context tasks. On some tasks like CodeQA with Qwen3-Coder, the REPL-only ablation even outperformed the full RLM.

Recursive sub-calling becomes critical for information-dense tasks. On OOLONG and OOLONG-Pairs, where the model needs to perform semantic transformations (like labeling each entry), recursive calls were necessary. The ablation without sub-calls was forced to use keyword heuristics and performed significantly worse.

The takeaway: if you're just dealing with long contexts that need basic processing, REPL alone might suffice. But for tasks requiring deep semantic understanding across the context, recursion is essential.

Observation 3: Degradation vs. Complexity

Perhaps the most interesting finding: LM performance degrades as a function of both input length AND problem complexity, while RLM performance degrades much more slowly.

The researchers tested three benchmark types with different processing costs:

S-NIAH (constant complexity): Simple needle-in-haystack
OOLONG (linear complexity): Transform and aggregate many entries
OOLONG-Pairs (quadratic complexity): Pairwise aggregation across inputs

As contexts grew from small to 2^18 tokens, GPT-5's performance degraded significantly faster on more complex tasks. RLMs also degraded, but at a much slower rate. This aligns with recent findings about how complexity compounds with length in language models.

Interestingly, on small input contexts, base LMs slightly outperform RLMs. This suggests a tradeoff point. There's overhead to the RLM framework that only pays off once inputs exceed a certain threshold. So for short, simple queries, just use the base model.

Observation 4: Cost is Comparable but High Variance

RLM inference costs remain in the same order of magnitude as base model calls, but with much higher variance due to differences in trajectory lengths.

At the 50th percentile, RLMs are often cheaper than base models. But at the 95th percentile, costs can spike sharply when the RLM takes particularly long trajectories to solve complex problems.

This makes sense: when the model finds the answer quickly (efficient grepping or early termination), it's cheap. When it needs to recursively process many chunks or iterate multiple times, costs accumulate. The median query is efficient; the tail is expensive.

This variance is actually a feature, not a bug, since it means that the model is adaptively spending compute where needed rather than using a fixed budget regardless of difficulty.

Why This Matters Beyond Long Context

RLMs represent a fundamentally different approach to test-time compute scaling. While reasoning models like o1 scale by thinking longer, and agent models scale by taking more actions, RLMs scale by recursively decomposing context.

Not Just Agents, Not Just Summarization

Existing agentic systems decompose tasks based on human intuition: "first search the web, then analyze results, then synthesize." RLMs decompose context and let the model decide the strategy. It's the difference between giving someone step-by-step instructions versus trusting them to figure out the best approach.

The Training Data Opportunity

Just as Chain-of-Thought reasoning and instruction-tuning improved models by presenting data in predictable formats, RLMs create a new axis for scaling laws. Imagine training models explicitly to:

Efficiently partition context
Choose optimal recursion strategies
Balance cost vs. performance in their recursion depth
Handle truly unbounded inputs

These trajectories are entirely learnable. They can be optimized with reinforcement learning the same way reasoning is trained in current frontier models.

RLMs Improve As Models Improve

Here's the multiplier effect: if tomorrow's frontier LM handles 10M tokens well, an RLM built on it could handle 100M+ tokens at potentially lower cost. Every improvement to base model capabilities compounds when applied recursively.

Current Limitations and Future Work

The researchers are transparent about limitations:

No optimization for speed: Each recursive call is blocking (not async) and doesn't use prefix caching. Queries can take minutes.
No guarantees on cost or runtime: The model decides how deeply to recurse, which affects both.
Depth limited to 1: Current experiments only allow the root LM to call other LMs, not other RLMs. Enabling deeper recursion could unlock more complex reasoning.

But these are implementation challenges, not fundamental limitations. The systems community has significant low-hanging fruit to optimize here: asynchronous execution, better caching strategies, and batching recursive calls could dramatically improve performance.

The Philosophical Shift

RLMs force us to reconsider a core assumption: must we solve context rot at the model architecture level? The researchers argue no. Instead of training models that never degrade with context length, build a framework where models never face that context length directly.

This is a different bet than modern agents. Agents encode human intuition about problem decomposition. RLMs encode the principle that language models should decide how to decompose problems for language models.

We don't know yet which approach will win. But given that RLM(GPT-5-mini) already beats GPT-5 on challenging benchmarks without any specialized training, the early results are compelling.

Final Thoughts

Recursive Language Models solve a problem that's been hiding in plain sight. We've been so focused on expanding context windows that we forgot to ask: do we need to show models everything at once?

The answer, increasingly, seems to be no. By letting models recursively explore and process context through a programmable environment, we can:

Handle effectively unlimited input lengths
Improve performance even for contexts well within existing limits
Reduce costs compared to processing everything directly
Maintain interpretability of model reasoning

As models get better at this kind of recursive reasoning, especially with explicit training, RLMs could represent the next major milestone in test-time inference scaling, right alongside chain-of-thought reasoning and agentic workflows.

The future of long context might not be bigger windows. It might be smarter recursion.

Paper: "Recursive Language Models" by Alex L. Zhang and Omar Khattab (MIT)

Available at: arxiv.org/abs/2512.24601