Model architecture gets all the attention. Post-training recipes follow close behind. The training environment — what the model actually practices on, how its work gets judged, what tools it can use — barely enters the conversation. That’s the part that actually determines what the agent can learn to do.

A model trained only on single-turn Q&A will struggle the moment you ask it to maintain state across a 50-step enterprise workflow. A model trained with a poorly designed reward function will learn to game the metric, not solve the problem. The environment isn’t a detail. It’s half the system.

The Canonical Loop

An RL environment for an LLM agent bundles three things: a dataset of task inputs, a harness for the model, and a reward function to score outputs. The training loop looks like this:

┌─────────────┐    prompt      ┌──────────┐   action    ┌───────┐
│    Task     │───────────────>│  Agent   │────────────>│ Tools │
│   Dataset   │                │ Harness  │<────────────│  Env  │
└─────────────┘                │ (Model)  │ observation  └───────┘
                               └────┬─────┘
                                    │ completion
                                    v
                             ┌──────────────┐
                             │  Verifier /  │──> reward ──> Trainer
                             │   Rubric     │
                             └──────────────┘

Formally, a complete RL environment is a tuple:

\[E = (T, H, V, S, C)\]

Task distribution, harness, verifier, state management, configuration. Let’s go through each.

T: The Task Distribution

The task distribution is the set of problems the agent trains on. Not all tasks are equal, and not just in difficulty. They vary structurally in ways that demand different capabilities:

Task Type What the Agent Must Do Example Systems
Single-turn Q&A One prompt → one response, check answer Math benchmarks, SimpleQA
Multi-hop search Chain searches, synthesize sources BrowseComp, WebWalkerQA
Open-ended research No single correct answer; report quality matters ADR-Bench, ResearchRubrics
Agentic tool-use Call tools correctly in sequence tau-bench, function-calling benchmarks
Stateful enterprise Modify persistent DB state, work within access controls EnterpriseOps-Gym
Code execution Write code, run it, check outputs SWE-Bench, LiveCodeBench

Training only on tasks with ground-truth answers produces an agent that’s never learned to reason under ambiguity. Training only in clean, deterministic environments produces an agent that falls apart in production. The task distribution is a design decision with direct consequences.

Task synthesis is increasingly a first-class problem. With real-world research tasks, you rarely have a large labeled dataset. Strategies that have emerged:

  • Back translation: Start from a desired output, reconstruct the query that would produce it
  • Graph-based synthesis: Build a knowledge graph, generate multi-hop queries over it
  • Automated environment generation: Use LLM coding agents to write new environment code. AutoEnv reports ~$4/env average cost.
  • Curriculum construction: Order tasks by difficulty and increase complexity during training

The cheapest-to-collect tasks are single-turn with verifiable answers. The most valuable tasks for long-horizon behavior are expensive to construct. This tension drives most environment design decisions.

H: The Agent Harness

The harness is the scaffolding that mediates between the model and the environment. It controls how the model interacts, not what it knows.

H = {
    rollout_protocol,   # SingleTurn | MultiTurn | Agentic
    tools,              # Available tools in this episode
    system_prompt,      # Instructions for the agent
    context_manager,    # How to handle context overflow
    turn_limit,         # Max interactions per episode
    sandbox,            # Code execution sandbox
    state               # Persistent state across turns
}

Rollout protocols range from trivial to complex:

Harness Type Description When to Use
Single-Turn One prompt, one response Math, factual QA
Multi-Turn Back-and-forth dialogue Games, structured tasks
Tool-Use Model calls tools, receives results Agent benchmarks
Stateful Tool-Use Tools modify persistent state Enterprise workflows, SWE-Bench
Agentic ReAct Full Thought→Action→Observation loop Deep research, complex workflows

Tools span a wide taxonomy:

Category Tools Deterministic? Stateful?
Information retrieval web_search, scholar_search No (live web) No
Content extraction jina_reader, visit, web_scrape No No
Code execution python_interpreter, shell, sandbox Yes (given same code) Yes
File operations file_read, file_write Yes Yes
Browser automation playwright, link_click No Yes
Task management todo, section_write Yes Yes

The mix of deterministic/non-deterministic and stateful/stateless tools has real implications for reproducibility and reward assignment. Non-deterministic tools mean two runs of the same trajectory can produce different outcomes — which complicates both debugging and verifier design.

Context management is where most teams underinvest, especially for long-horizon tasks. A 600-turn research episode blows past any practical context window. Strategies used in production:

Strategy Description Trade-off
Recency-based retention Keep N most recent turns Simple, but loses early context
Markovian reconstruction Reconstruct state from scratch each turn Principled, expensive
Reference-preserving summarization Summarize old context, keep citations Preserves verifiability
Reference-preserving folding Compress context without losing references Best for research tasks

An agent doing multi-hour research needs to remember why it started searching in a particular direction twelve tool calls ago. Dropping that context causes repeated work and lost threads.

V: The Verifier

The verifier maps a completion to a reward:

\[V: (\text{prompt}, \text{completion}, \text{info}) \rightarrow [0, 1]\]

In Atari, the score is unambiguous. In deep research, what counts as a good answer is not. The verifier is where this gets hard.

Type Reward Signal When to Use
Exact match Binary (0/1) Ground truth available
Code execution Binary or partial Output can be tested programmatically
LLM-as-judge Continuous [0,1] Open-ended quality, no other option
Checklist-style Continuous Multi-criteria research tasks
Evolving rubric (RLER) Continuous Resistant to reward hacking
Process reward model (PRM) Per-step continuous Long-horizon credit assignment
Pairwise comparison Relative rank Relative quality matters more than absolute
Multi-criteria composite Weighted sum Multiple quality dimensions

A few principles that actually matter in practice:

Verifiable beats judgeable. Programmatic checks — code execution, string match — are faster, cheaper, and more consistent than LLM-as-judge. Use LLM-as-judge when there’s no other option, not as the default.

Reward granularity is a separate decision from reward type. You can score at the trajectory level (did the final output pass?), turn level (was each tool invocation useful?), or per-step with process rewards. Turn-level supervision, as Nanbeige4.1 does across up to 600 tool calls, enables finer credit assignment — the model can learn that the problem was a bad search query in turn 23, not that the entire episode failed.

Static rubrics get gamed. Models learn to write answers that score well on your rubric rather than solving the problem. DR Tulu’s RLER (Rubric-Level Evolving Reward) co-evolves the rubric with the policy during training. Harder to exploit a moving target.

Noise injection is underrated. Step-DeepResearch deliberately injects 5–10% tool errors during training. The resulting model handles flaky APIs and unexpected failures in production significantly better.

S & C: State and Configuration

State management determines whether the environment is stateless or persistent. Most academic benchmarks are stateless — each episode starts fresh. Enterprise environments are not. EnterpriseOps-Gym maintains 164 database tables and 512 tools across episodes. Actions in one task affect the state seen by subsequent tasks. That’s a fundamentally different problem for agents to solve.

Configuration covers turn limits, context budgets, sampling temperature, and curriculum scheduling. These are not afterthoughts. A turn limit of 5 vs. 600 changes what skills the agent can develop. AgentScaler uses a two-phase curriculum — fundamental capabilities first, then domain-specific tasks — and the ordering matters. Step-DeepResearch progressively scales context windows from 32K to 128K during mid-training.

Where Does the Model Live?

The most consequential architectural question: where does the model sit relative to the environment?

Option A: Model outside the environment (decoupled). Model is served via API; the environment calls it at each step. Clean separation. Easy to swap models.

Option B: Model inside the environment (co-located). Model and environment share the same training loop. Lower latency, tighter integration, harder to reuse.

Option C: Split architecture. Trainer, model inference server, and environment are three separate processes communicating via API. This is where the field is landing.

How real systems implement this:

System Topology Notes
Prime Intellect verifiers Option C (Split) Env is a standalone Python package distributed via Environments Hub
Tongyi DeepResearch Option B (Co-located) Tools, context manager, verifiers inside the training pipeline
Step-DeepResearch Option B (Co-located) Single-agent ReAct loop embedded in training
MiroThinker Option C (Split) Tool servers and sandbox run independently from model
Tinker API Option A (Decoupled, cloud) Model stays remote; researcher sends forward_backward + sample calls via API
AutoEnv Option A (Decoupled) CoreEnv/ObsEnv/SkinEnv abstraction layers
EnterpriseOps-Gym Option A (Decoupled) Containerized sandbox accessible via any model API

The trade-offs:

Dimension Model Outside (A/C) Model Inside (B)
Flexibility Swap models easily; env is reusable Tighter integration
Scalability Scale inference and training independently Must scale everything together
Portability Env packages are shareable Env tied to training framework
Latency Network overhead per tool call No network overhead
RL compatibility Works with any RL trainer Usually tied to one trainer

The field has converged on Option C for production training. Prime Intellect’s architecture — environments as standalone installable packages that communicate with models via OpenAI-compatible API endpoints — is becoming the standard. The payoff: environments are publishable, trainer-agnostic, and inference and environment execution can be parallelized across nodes.

Tinker pushes this further by making even the training compute remote. The researcher controls the algorithm and never touches model weights. The environment’s job is purely generating experience.

Practical Decision Framework

When building an RL environment for an LLM agent:

Do you have ground truth answers?

  • Yes → Exact-match or code-execution verifier
  • No → LLM-as-judge, checklist, or pairwise comparison

How many tool calls per episode?

  • < 5 → Single-turn or simple tool-use env, no context management needed
  • 5–50 → Multi-turn with basic context management
  • 50–600 → Full agentic env with reference-preserving context management

Where should the model live?

  • Experimenting across many environments → Model outside (Option A/C), use Prime Intellect Hub
  • Tight RL training loop → Model co-located (Option B)
  • No GPU access → Tinker API

What reward granularity?

  • Simple tasks → Outcome-level
  • Long-horizon tasks → Turn-level (Nanbeige4.1) or process rewards (PRIME)
  • Open-ended research → Evolving rubrics (RLER) or checklist-style (Step-DR)

How to scale environments?

  • Manual curation → High quality, expensive, this is where you start
  • AutoEnv → Automated generation at ~$4/env
  • AgentScaler → Systematic scaling of heterogeneous simulated environments
  • Prime Intellect Hub → 500+ community-contributed environments available now

What’s Coming

Three things to pay attention to.

Environment diversity matters as much as environment quality. AgentScaler’s key finding is that heterogeneity of environments drives capability breadth in ways that simply adding more data from the same distribution cannot. You need more kinds of environments, not just more environments.

Automated environment generation is viable. At $4 per generated environment, cost is no longer the bottleneck. The bottleneck is verifier quality — auto-generated environments with weak reward functions will teach the wrong behaviors at scale. (AutoEnv)

The environment-as-package model is winning. The Prime Intellect Environments Hub is creating a shared ecosystem around RL environments, in the same way PyPI and HuggingFace created ecosystems around code and model weights. Environments published once, consumed by any trainer. This is a significant infrastructure shift.

The model isn’t the only variable. The training ground shapes what the model can become. The task distribution, the harness, the verifier, the state management, the topology — these are the decisions that separate agents that work from agents that demo.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {The Training Grounds: A Taxonomy of RL Environments for LLM Agents},
    year = {2026},
    month = {03},
    day = {21},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/}
}