The Training Grounds: A Taxonomy of RL Environments for LLM Agents
Model architecture gets all the attention. Post-training recipes follow close behind. The training environment — what the model actually practices on, how its work gets judged, what tools it can use — barely enters the conversation. That’s the part that actually determines what the agent can learn to do.
A model trained only on single-turn Q&A will struggle the moment you ask it to maintain state across a 50-step enterprise workflow. A model trained with a poorly designed reward function will learn to game the metric, not solve the problem. The environment isn’t a detail. It’s half the system.
The Canonical Loop
An RL environment for an LLM agent bundles three things: a dataset of task inputs, a harness for the model, and a reward function to score outputs. The training loop looks like this:
┌─────────────┐ prompt ┌──────────┐ action ┌───────┐
│ Task │───────────────>│ Agent │────────────>│ Tools │
│ Dataset │ │ Harness │<────────────│ Env │
└─────────────┘ │ (Model) │ observation └───────┘
└────┬─────┘
│ completion
v
┌──────────────┐
│ Verifier / │──> reward ──> Trainer
│ Rubric │
└──────────────┘
Formally, a complete RL environment is a tuple:
\[E = (T, H, V, S, C)\]Task distribution, harness, verifier, state management, configuration. Let’s go through each.
T: The Task Distribution
The task distribution is the set of problems the agent trains on. Not all tasks are equal, and not just in difficulty. They vary structurally in ways that demand different capabilities:
| Task Type | What the Agent Must Do | Example Systems |
|---|---|---|
| Single-turn Q&A | One prompt → one response, check answer | Math benchmarks, SimpleQA |
| Multi-hop search | Chain searches, synthesize sources | BrowseComp, WebWalkerQA |
| Open-ended research | No single correct answer; report quality matters | ADR-Bench, ResearchRubrics |
| Agentic tool-use | Call tools correctly in sequence | tau-bench, function-calling benchmarks |
| Stateful enterprise | Modify persistent DB state, work within access controls | EnterpriseOps-Gym |
| Code execution | Write code, run it, check outputs | SWE-Bench, LiveCodeBench |
Training only on tasks with ground-truth answers produces an agent that’s never learned to reason under ambiguity. Training only in clean, deterministic environments produces an agent that falls apart in production. The task distribution is a design decision with direct consequences.
Task synthesis is increasingly a first-class problem. With real-world research tasks, you rarely have a large labeled dataset. Strategies that have emerged:
- Back translation: Start from a desired output, reconstruct the query that would produce it
- Graph-based synthesis: Build a knowledge graph, generate multi-hop queries over it
- Automated environment generation: Use LLM coding agents to write new environment code. AutoEnv reports ~$4/env average cost.
- Curriculum construction: Order tasks by difficulty and increase complexity during training
The cheapest-to-collect tasks are single-turn with verifiable answers. The most valuable tasks for long-horizon behavior are expensive to construct. This tension drives most environment design decisions.
H: The Agent Harness
The harness is the scaffolding that mediates between the model and the environment. It controls how the model interacts, not what it knows.
H = {
rollout_protocol, # SingleTurn | MultiTurn | Agentic
tools, # Available tools in this episode
system_prompt, # Instructions for the agent
context_manager, # How to handle context overflow
turn_limit, # Max interactions per episode
sandbox, # Code execution sandbox
state # Persistent state across turns
}
Rollout protocols range from trivial to complex:
| Harness Type | Description | When to Use |
|---|---|---|
| Single-Turn | One prompt, one response | Math, factual QA |
| Multi-Turn | Back-and-forth dialogue | Games, structured tasks |
| Tool-Use | Model calls tools, receives results | Agent benchmarks |
| Stateful Tool-Use | Tools modify persistent state | Enterprise workflows, SWE-Bench |
| Agentic ReAct | Full Thought→Action→Observation loop | Deep research, complex workflows |
Tools span a wide taxonomy:
| Category | Tools | Deterministic? | Stateful? |
|---|---|---|---|
| Information retrieval | web_search, scholar_search | No (live web) | No |
| Content extraction | jina_reader, visit, web_scrape | No | No |
| Code execution | python_interpreter, shell, sandbox | Yes (given same code) | Yes |
| File operations | file_read, file_write | Yes | Yes |
| Browser automation | playwright, link_click | No | Yes |
| Task management | todo, section_write | Yes | Yes |
The mix of deterministic/non-deterministic and stateful/stateless tools has real implications for reproducibility and reward assignment. Non-deterministic tools mean two runs of the same trajectory can produce different outcomes — which complicates both debugging and verifier design.
Context management is where most teams underinvest, especially for long-horizon tasks. A 600-turn research episode blows past any practical context window. Strategies used in production:
| Strategy | Description | Trade-off |
|---|---|---|
| Recency-based retention | Keep N most recent turns | Simple, but loses early context |
| Markovian reconstruction | Reconstruct state from scratch each turn | Principled, expensive |
| Reference-preserving summarization | Summarize old context, keep citations | Preserves verifiability |
| Reference-preserving folding | Compress context without losing references | Best for research tasks |
An agent doing multi-hour research needs to remember why it started searching in a particular direction twelve tool calls ago. Dropping that context causes repeated work and lost threads.
V: The Verifier
The verifier maps a completion to a reward:
\[V: (\text{prompt}, \text{completion}, \text{info}) \rightarrow [0, 1]\]In Atari, the score is unambiguous. In deep research, what counts as a good answer is not. The verifier is where this gets hard.
| Type | Reward Signal | When to Use |
|---|---|---|
| Exact match | Binary (0/1) | Ground truth available |
| Code execution | Binary or partial | Output can be tested programmatically |
| LLM-as-judge | Continuous [0,1] | Open-ended quality, no other option |
| Checklist-style | Continuous | Multi-criteria research tasks |
| Evolving rubric (RLER) | Continuous | Resistant to reward hacking |
| Process reward model (PRM) | Per-step continuous | Long-horizon credit assignment |
| Pairwise comparison | Relative rank | Relative quality matters more than absolute |
| Multi-criteria composite | Weighted sum | Multiple quality dimensions |
A few principles that actually matter in practice:
Verifiable beats judgeable. Programmatic checks — code execution, string match — are faster, cheaper, and more consistent than LLM-as-judge. Use LLM-as-judge when there’s no other option, not as the default.
Reward granularity is a separate decision from reward type. You can score at the trajectory level (did the final output pass?), turn level (was each tool invocation useful?), or per-step with process rewards. Turn-level supervision, as Nanbeige4.1 does across up to 600 tool calls, enables finer credit assignment — the model can learn that the problem was a bad search query in turn 23, not that the entire episode failed.
Static rubrics get gamed. Models learn to write answers that score well on your rubric rather than solving the problem. DR Tulu’s RLER (Rubric-Level Evolving Reward) co-evolves the rubric with the policy during training. Harder to exploit a moving target.
Noise injection is underrated. Step-DeepResearch deliberately injects 5–10% tool errors during training. The resulting model handles flaky APIs and unexpected failures in production significantly better.
S & C: State and Configuration
State management determines whether the environment is stateless or persistent. Most academic benchmarks are stateless — each episode starts fresh. Enterprise environments are not. EnterpriseOps-Gym maintains 164 database tables and 512 tools across episodes. Actions in one task affect the state seen by subsequent tasks. That’s a fundamentally different problem for agents to solve.
Configuration covers turn limits, context budgets, sampling temperature, and curriculum scheduling. These are not afterthoughts. A turn limit of 5 vs. 600 changes what skills the agent can develop. AgentScaler uses a two-phase curriculum — fundamental capabilities first, then domain-specific tasks — and the ordering matters. Step-DeepResearch progressively scales context windows from 32K to 128K during mid-training.
Where Does the Model Live?
The most consequential architectural question: where does the model sit relative to the environment?
Option A: Model outside the environment (decoupled). Model is served via API; the environment calls it at each step. Clean separation. Easy to swap models.
Option B: Model inside the environment (co-located). Model and environment share the same training loop. Lower latency, tighter integration, harder to reuse.
Option C: Split architecture. Trainer, model inference server, and environment are three separate processes communicating via API. This is where the field is landing.
How real systems implement this:
| System | Topology | Notes |
|---|---|---|
| Prime Intellect verifiers | Option C (Split) | Env is a standalone Python package distributed via Environments Hub |
| Tongyi DeepResearch | Option B (Co-located) | Tools, context manager, verifiers inside the training pipeline |
| Step-DeepResearch | Option B (Co-located) | Single-agent ReAct loop embedded in training |
| MiroThinker | Option C (Split) | Tool servers and sandbox run independently from model |
| Tinker API | Option A (Decoupled, cloud) | Model stays remote; researcher sends forward_backward + sample calls via API |
| AutoEnv | Option A (Decoupled) | CoreEnv/ObsEnv/SkinEnv abstraction layers |
| EnterpriseOps-Gym | Option A (Decoupled) | Containerized sandbox accessible via any model API |
The trade-offs:
| Dimension | Model Outside (A/C) | Model Inside (B) |
|---|---|---|
| Flexibility | Swap models easily; env is reusable | Tighter integration |
| Scalability | Scale inference and training independently | Must scale everything together |
| Portability | Env packages are shareable | Env tied to training framework |
| Latency | Network overhead per tool call | No network overhead |
| RL compatibility | Works with any RL trainer | Usually tied to one trainer |
The field has converged on Option C for production training. Prime Intellect’s architecture — environments as standalone installable packages that communicate with models via OpenAI-compatible API endpoints — is becoming the standard. The payoff: environments are publishable, trainer-agnostic, and inference and environment execution can be parallelized across nodes.
Tinker pushes this further by making even the training compute remote. The researcher controls the algorithm and never touches model weights. The environment’s job is purely generating experience.
Practical Decision Framework
When building an RL environment for an LLM agent:
Do you have ground truth answers?
- Yes → Exact-match or code-execution verifier
- No → LLM-as-judge, checklist, or pairwise comparison
How many tool calls per episode?
- < 5 → Single-turn or simple tool-use env, no context management needed
- 5–50 → Multi-turn with basic context management
- 50–600 → Full agentic env with reference-preserving context management
Where should the model live?
- Experimenting across many environments → Model outside (Option A/C), use Prime Intellect Hub
- Tight RL training loop → Model co-located (Option B)
- No GPU access → Tinker API
What reward granularity?
- Simple tasks → Outcome-level
- Long-horizon tasks → Turn-level (Nanbeige4.1) or process rewards (PRIME)
- Open-ended research → Evolving rubrics (RLER) or checklist-style (Step-DR)
How to scale environments?
- Manual curation → High quality, expensive, this is where you start
- AutoEnv → Automated generation at ~$4/env
- AgentScaler → Systematic scaling of heterogeneous simulated environments
- Prime Intellect Hub → 500+ community-contributed environments available now
What’s Coming
Three things to pay attention to.
Environment diversity matters as much as environment quality. AgentScaler’s key finding is that heterogeneity of environments drives capability breadth in ways that simply adding more data from the same distribution cannot. You need more kinds of environments, not just more environments.
Automated environment generation is viable. At $4 per generated environment, cost is no longer the bottleneck. The bottleneck is verifier quality — auto-generated environments with weak reward functions will teach the wrong behaviors at scale. (AutoEnv)
The environment-as-package model is winning. The Prime Intellect Environments Hub is creating a shared ecosystem around RL environments, in the same way PyPI and HuggingFace created ecosystems around code and model weights. Environments published once, consumed by any trainer. This is a significant infrastructure shift.
The model isn’t the only variable. The training ground shapes what the model can become. The task distribution, the harness, the verifier, the state management, the topology — these are the decisions that separate agents that work from agents that demo.
References
- Prime Intellect verifiers library
- AgentScaler: Scaling LLM Agent Training with Automatically Constructed Environments (arXiv 2509.13311)
- AutoEnv: Towards Automated Reinforcement Learning Environment Design (arXiv 2511.19304)
- EnterpriseOps-Gym: A Benchmark for Enterprise Operations Agents (arXiv 2603.13594)
- It’s-a Me, Agentic AI
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
@article{
leehanchung,
author = {Lee, Hanchung},
title = {The Training Grounds: A Taxonomy of RL Environments for LLM Agents},
year = {2026},
month = {03},
day = {21},
howpublished = {\url{https://leehanchung.github.io}},
url = {https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/}
}