Agentic AI is a fairly recent development that combines reasoning (OpenAI, 2024) tool use (Schick et al, 2023) in the same AI model. But an agentic AI system is not just the model, but also the harness, the environments, the tools, rewards, evaluations, and all of the infrastructure to support it. In this post, let’s use the top grossing franchise, Mario, to tell this story for understanding how agentic AI models are developed, how agent harnesses work, and how reinforcement learning ties everything together. If you survived World 8-4 as a kid, you already have the intuition for building agentic AI systems.
alt text

Small Mario: The Base Model

Small Mario is the base pretrained model. He’s just come out of pretraining on a massive corpus of platform game physics. He can walk. He can jump. He can move left and right. These are his base capabilities, the raw knowledge compressed from the training data.

But Small Mario is fragile. One hit from a Goomba and he’s dead. He can’t break bricks. He can’t take damage. He has potential, but he’s not yet useful for anything beyond the most trivial tasks.

This is your base LLM fresh off pretraining. It has absorbed enormous amounts of knowledge, it can do next-token prediction, and it can sort-of follow instructions. But ask it to do anything real, reliably, in production, and it falls apart on the first obstacle. One Goomba and it’s game over.

The Super Mushroom: Model Harness

Then Mario finds the Super Mushroom. He doubles in size. He can now break bricks. He can take a hit without dying. He goes from fragile to capable.

The Super Mushroom is the model harness, the entire infrastructure and post-training pipeline that transforms a base model into something production-ready. This includes:

  • Instruction tuning (RLHF, DPO, etc.) to make the model actually follow directions
  • System prompts that define personality and constraints
  • Safety guardrails so it doesn’t run off cliffs
  • Memory and context management so it remembers where it’s been
  • Tool-use training so it knows power-ups exist and how to grab them

Without the Super Mushroom, Mario is a liability. With it, he’s a platform for greatness. Similarly, without the model harness, a base LLM is a research artifact. With it, it’s a product.

The Super Mushroom doesn’t change WHO Mario is. It changes what he can SURVIVE. The model harness doesn’t change the model’s core knowledge. It changes what the model can handle in production.

Power-Ups: Agent Skills and Tools

Now here’s where it gets interesting. Super Mario can pick up power-ups that give him entirely new capabilities:

Power-Up Mario Ability Agent Equivalent
Fire Flower Throw fireballs to defeat enemies at range Code execution - reach out and run code to solve problems the model can’t solve with text alone
Frog Suit Swim through water levels with ease Web search - navigate and retrieve information from an environment the model wasn’t trained on
Raccoon Leaf Fly and reach otherwise inaccessible areas File system access - explore and modify the codebase, reach files the model couldn’t otherwise touch
Hammer Suit Throw hammers, defeat almost anything Shell/terminal access - the most powerful tool, can interact with the full system
Star Temporary invincibility, blow through everything Extended thinking/reasoning mode - brute force through complex problems at higher compute cost
Cape Feather Sustained flight and spin attack MCP servers - sustained, extensible access to external services and APIs

Each power-up doesn’t replace Mario’s core abilities. Mario still walks and jumps. The power-ups extend what he can do. A Fire Flower Mario can still jump on Goombas, but now he can also shoot fireballs at Piranha Plants hiding in pipes.

This is exactly how agent tools work. The LLM still does what LLMs do: reasoning, language understanding, and planning. Tools extend the model’s reach into environments it can’t operate in alone. An LLM can’t execute Python by itself, just like Mario can’t throw fireballs without a Fire Flower. But give it the right tool, and suddenly the problem space opens up.

And critically, Mario has to learn WHEN to use each power-up. Frog Suit is amazing in water levels, useless on land. Fire Flower is great against Goombas, pointless against Thwomps. The model needs to learn tool selection, knowing which tool to reach for in which context. This is one of the hardest parts of building agentic systems.

One power-up deserves special attention: the Star. When Mario grabs a Star, he becomes invincible. He plows through Goombas, Koopa Troopas, Piranha Plants, everything in his path just disintegrates. Nothing can stop him. This is not dissimilar to having an engineering manager who’s really good at clearing organizational blockers for their engineers. The Goombas and Piranha Plants of bureaucracy, cross-team dependencies, access requests, and priority conflicts just melt away. Star power is temporary and expensive, but when you need to blast through a critical path, nothing else comes close.

The Mushroom Kingdom: Environments and Tasks

Now let’s talk about the world Mario operates in. Every level in the Mushroom Kingdom is an environment, and every environment is composed of the same building blocks:

Level Element Environment Equivalent
Bricks Structured data, APIs, databases that can be broken open to reveal resources
? Blocks Unknown information sources, sometimes containing exactly what you need
Pipes Entry points to sub-tasks, function calls, or deeper exploration
Coins Incremental progress signals, intermediate rewards
Goombas Common obstacles, predictable errors, edge cases
Koopa Troopas Recyclable obstacles, errors whose solutions can be reused as tools
Piranha Plants Traps hiding in seemingly safe passages, hallucination triggers
Vines Hidden paths to bonus areas, unexpected shortcuts
Clouds / Platforms Abstractions and scaffolding that support traversal
Pits Catastrophic failures, unrecoverable errors

Every level remixes these elements differently. World 1-1 is simple, a few Goombas, some bricks, a clear path to the flag. World 8-4 is a maze of pipes, hidden paths, and a boss fight with Bowser. Same building blocks, radically different difficulty.

Throughout each level, Mario interacts with the environment as tool use. He enters pipes to warp from one place to another, the equivalent of an API call that transports you to an entirely different context. He bumps ? Blocks from below to discover power-ups, new agent skills materializing from the environment when you know where to look. He breaks bricks to clear paths or reveal hidden rewards, structured data yielding its value when you apply force in the right direction. He stomps on a Koopa shell and kicks it forward, turning an obstacle into a projectile that clears a line of Goombas, repurposing error outputs as inputs to solve downstream problems. The environment isn’t just a backdrop. It’s a toolbox.

Objectives: What Are We Playing For?

Having an environment is not enough. We need to define what we want to achieve from playing the game. These are the tasks.

Mario can play the same level with completely different objectives. Complete the level. Complete the level in the shortest amount of time. Get the highest score. Collect the most coins. Accumulate the most 1-up lives. Find all the hidden rewards. Stomp on every last Goomba. Each objective produces a fundamentally different play style from the same environment.

This is exactly the task definition problem in agentic AI. The same model operating in the same environment can be pointed at wildly different goals. “Summarize this codebase” and “refactor this codebase” use the same files, the same tools, the same context, but they require entirely different strategies. The task is what transforms an environment from a sandbox into a mission.

The Flagpole: Evaluations and Reward Modeling

At the end of every level, there’s a flagpole. Mario jumps on it, pulls down the flag, and receives a reward. The higher he grabs the flag, the bigger the reward. Some levels end with a boss fight against Bowser, where the reward is freeing a Toad (or eventually, the Princess).

This is the reward signal in reinforcement learning. But here’s the critical question: how do we actually measure how well the game was played? This is evaluation, or reward modeling, and it is where the machine learning engineering discipline really shines. (or research engineer, applied AI engineer. )

The evaluation could be the raw score. It could be the number of 1-ups gained. Coins collected. Goombas stomped. Time remaining. Different paths discovered. Most frequently, it is a combination of some or all of the above, weighted and balanced against each other. Do we reward Mario more for speed or for thoroughness? For survival or for aggression? For finding secrets or for staying on the critical path?

Evaluation Metric Mario Measure Agent Measure
Speed Time remaining on the clock Task completion latency
Score Points accumulated Overall output quality
Collection Coins gathered Information retrieved, resources used efficiently
Completeness Hidden blocks found, secrets discovered Edge cases handled, comprehensive coverage
Efficiency Enemies defeated per life Correct tool invocations per task
Exploration Different paths taken Novel approaches discovered

Designing these rewards is a strict machine learning engineering discipline. A poorly shaped reward function produces an agent that technically completes tasks but in degenerate ways, like a Mario speedrunner who clips through walls. Impressive, but not what we actually wanted. Reward hacking is the Goodhart’s Law of agentic AI: when a measure becomes a target, it ceases to be a good measure.

Learning to Play: Reinforcement Learning

Here’s where the full picture comes together. How does Mario learn to complete levels? Reinforcement learning.

Mario starts each level knowing nothing about its specific layout. He has to:

  1. Observe the current state, what’s on screen, where the enemies are, what power-ups are available
  2. Decide on an action based on his policy, jump, run, shoot, or wait
  3. Act and receive feedback from the environment
  4. Update his policy based on the outcome

This is the MDP (Markov Decision Process) loop. The same loop described in Agents Are Workflows. The same loop that every agentic AI system runs:

\[v^\pi(s) = \mathop{\mathbb{E}}[r(s, a) + \gamma v_\pi(s^\prime)]\]

The value of Mario’s current state equals the expected immediate reward plus the discounted value of the next state. Should Mario jump NOW to get the coin, or wait and avoid the Goomba? The optimal policy $\pi^*$ balances immediate rewards against future outcomes.

Through repeated play (training episodes), Mario learns:

  • Which obstacles can be jumped on vs. avoided
  • When to use power-ups vs. save them
  • Which pipes lead to shortcuts vs. dead ends
  • How to handle boss fights

An agentic AI model goes through the same process. Through reinforcement learning (PPO, DPO, GRPO, or whatever the latest acronym is), the model learns:

  • Which tools to invoke for which subtasks
  • When to think longer vs. act immediately
  • Which approaches work for which problem types
  • How to decompose complex tasks into manageable steps

Behind the Controller: The Engineers

So who’s actually making all of this work? Mario doesn’t train himself.

To teach agent Mario to be really good at the game, we employ an Machine Learning Engineer (MLE). The MLE is the game designer and coach rolled into one. They construct mock environments, set up the harnesses and tools, define the tasks that Mario needs to achieve, and most importantly, design the reward function. The MLE decides what “good” looks like. Do we reward Mario for speed? Thoroughness? Both? How much? This reward design is the single highest leverage decision in the entire pipeline. Get it right and Mario learns to play beautifully. Get it wrong and Mario learns to exploit glitches.

Once the MLE has designed the training setup, the Machine Learning Systems Engineers (MLSys) take over. They are the ones who actually run the show at scale. They set up the environment and agent Mario across hundreds to hundreds of thousands of environments, tasks and iterations. They manage the compute, the distributed training runs, the data pipelines. They collect the reasoning traces, the sequences of observations, actions, and outcomes from every single episode Mario plays. And from these traces, they run the reinforcement learning algorithms that allow agent Mario to learn from experience.

This is the unsexy but critical part. An MLE can design the most elegant reward function in the world, but without MLSys engineers standing up the infrastructure to run millions of training episodes and collect the resulting data, Mario never gets past World 1-1.

Worlds: Agent Frameworks and Domains

The Mushroom Kingdom isn’t one level. It’s organized into Worlds, each with a distinct theme and set of challenges:

World Theme Agent Domain
World 1 Grassland, basics Simple text tasks, Q&A, summarization
World 2 Desert Data analysis, structured environments
World 3 Water Web and API navigation
World 4 Giant Land Large-scale refactoring, migrations
World 5 Sky Architecture and system design
World 6 Ice Debugging in fragile or legacy environments
World 7 Pipe Maze Complex multi-step workflows with dependencies
World 8 Bowser’s Castle Full autonomous task completion with adversarial conditions

Agent frameworks are the World Maps - they organize how the agent navigates between levels (subtasks) and worlds (domains). Some frameworks are rigid, you follow the path laid out. Others let you pick your route. The best ones (like how Super Mario Bros. 3 introduced the overworld map) let the agent make strategic choices about which level to tackle next.

But remember: the model is the product. The World Map (framework) is useful scaffolding, but Mario does the actual playing. As models get stronger, they need less scaffolding. Today’s World 8 is tomorrow’s World 1.

Conclusion

The entire agentic AI stack maps to Mario with surprising fidelity:

Mario Agentic AI
Small Mario Base pretrained model
Super Mushroom Model harness and post-training
Power-ups Agent tools and skills
Star Power Engineering manager clearing blockers
Levels Task environments
Pipes, bricks, shells Tool use within the environment
Objectives (speed, score, coins) Task definitions
Flagpole / Boss Reward signal and evaluation
Reward design Reward modeling (ML engineering discipline)
Learning to play Reinforcement learning
MLE as game designer ML engineer designing tasks and rewards
MLSys running millions of episodes ML infrastructure for training at scale
World Map Agent framework
Multiplayer Multi-agent systems

The next time someone asks you to explain agentic AI, ask them: have you played Mario?

If they can understand that Small Mario needs a mushroom to survive, power-ups to be effective, and practice to master each level, they can understand agentic AI development.

Now go save Princess Peach.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {It's-a Me, Agentic AI},
    year = {2026},
    month = {02},
    day = {18},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/}
}