It's-a Me, Agentic AI
Agentic AI is a fairly recent development that combines reasoning (OpenAI, 2024) tool use (Schick et al, 2023) in the same AI model. But an agentic AI system is not just the model, but also the harness, the environments, the tools, rewards, evaluations, and all of the infrastructure to support it. In this post, let’s use the top grossing franchise, Mario, to tell this story for understanding how agentic AI models are developed, how agent harnesses work, and how reinforcement learning ties everything together. If you survived World 8-4 as a kid, you already have the intuition for building agentic AI systems.

Small Mario: The Base Model
Small Mario is the base pretrained model. He’s just come out of pretraining on a massive corpus of platform game physics. He can walk. He can jump. He can move left and right. These are his base capabilities, the raw knowledge compressed from the training data.
But Small Mario is fragile. One hit from a Goomba and he’s dead. He can’t break bricks. He can’t take damage. He has potential, but he’s not yet useful for anything beyond the most trivial tasks.
This is your base LLM fresh off pretraining. It has absorbed enormous amounts of knowledge, it can do next-token prediction, and it can sort-of follow instructions. But ask it to do anything real, reliably, in production, and it falls apart on the first obstacle. One Goomba and it’s game over.
The Super Mushroom: Model Harness
Then Mario finds the Super Mushroom. He doubles in size. He can now break bricks. He can take a hit without dying. He goes from fragile to capable.
The Super Mushroom is the model harness, the entire infrastructure and post-training pipeline that transforms a base model into something production-ready. This includes:
- Instruction tuning (RLHF, DPO, etc.) to make the model actually follow directions
- System prompts that define personality and constraints
- Safety guardrails so it doesn’t run off cliffs
- Memory and context management so it remembers where it’s been
- Tool-use training so it knows power-ups exist and how to grab them
Without the Super Mushroom, Mario is a liability. With it, he’s a platform for greatness. Similarly, without the model harness, a base LLM is a research artifact. With it, it’s a product.
The Super Mushroom doesn’t change WHO Mario is. It changes what he can SURVIVE. The model harness doesn’t change the model’s core knowledge. It changes what the model can handle in production.
Power-Ups: Agent Skills and Tools
Now here’s where it gets interesting. Super Mario can pick up power-ups that give him entirely new capabilities:
| Power-Up | Mario Ability | Agent Equivalent |
|---|---|---|
| Fire Flower | Throw fireballs to defeat enemies at range | Code execution - reach out and run code to solve problems the model can’t solve with text alone |
| Frog Suit | Swim through water levels with ease | Web search - navigate and retrieve information from an environment the model wasn’t trained on |
| Raccoon Leaf | Fly and reach otherwise inaccessible areas | File system access - explore and modify the codebase, reach files the model couldn’t otherwise touch |
| Hammer Suit | Throw hammers, defeat almost anything | Shell/terminal access - the most powerful tool, can interact with the full system |
| Star | Temporary invincibility, blow through everything | Extended thinking/reasoning mode - brute force through complex problems at higher compute cost |
| Cape Feather | Sustained flight and spin attack | MCP servers - sustained, extensible access to external services and APIs |
Each power-up doesn’t replace Mario’s core abilities. Mario still walks and jumps. The power-ups extend what he can do. A Fire Flower Mario can still jump on Goombas, but now he can also shoot fireballs at Piranha Plants hiding in pipes.
This is exactly how agent tools work. The LLM still does what LLMs do: reasoning, language understanding, and planning. Tools extend the model’s reach into environments it can’t operate in alone. An LLM can’t execute Python by itself, just like Mario can’t throw fireballs without a Fire Flower. But give it the right tool, and suddenly the problem space opens up.
And critically, Mario has to learn WHEN to use each power-up. Frog Suit is amazing in water levels, useless on land. Fire Flower is great against Goombas, pointless against Thwomps. The model needs to learn tool selection, knowing which tool to reach for in which context. This is one of the hardest parts of building agentic systems.
One power-up deserves special attention: the Star. When Mario grabs a Star, he becomes invincible. He plows through Goombas, Koopa Troopas, Piranha Plants, everything in his path just disintegrates. Nothing can stop him. This is not dissimilar to having an engineering manager who’s really good at clearing organizational blockers for their engineers. The Goombas and Piranha Plants of bureaucracy, cross-team dependencies, access requests, and priority conflicts just melt away. Star power is temporary and expensive, but when you need to blast through a critical path, nothing else comes close.
The Mushroom Kingdom: Environments and Tasks
Now let’s talk about the world Mario operates in. Every level in the Mushroom Kingdom is an environment, and every environment is composed of the same building blocks:
| Level Element | Environment Equivalent |
|---|---|
| Bricks | Structured data, APIs, databases that can be broken open to reveal resources |
| ? Blocks | Unknown information sources, sometimes containing exactly what you need |
| Pipes | Entry points to sub-tasks, function calls, or deeper exploration |
| Coins | Incremental progress signals, intermediate rewards |
| Goombas | Common obstacles, predictable errors, edge cases |
| Koopa Troopas | Recyclable obstacles, errors whose solutions can be reused as tools |
| Piranha Plants | Traps hiding in seemingly safe passages, hallucination triggers |
| Vines | Hidden paths to bonus areas, unexpected shortcuts |
| Clouds / Platforms | Abstractions and scaffolding that support traversal |
| Pits | Catastrophic failures, unrecoverable errors |
Every level remixes these elements differently. World 1-1 is simple, a few Goombas, some bricks, a clear path to the flag. World 8-4 is a maze of pipes, hidden paths, and a boss fight with Bowser. Same building blocks, radically different difficulty.
Throughout each level, Mario interacts with the environment as tool use. He enters pipes to warp from one place to another, the equivalent of an API call that transports you to an entirely different context. He bumps ? Blocks from below to discover power-ups, new agent skills materializing from the environment when you know where to look. He breaks bricks to clear paths or reveal hidden rewards, structured data yielding its value when you apply force in the right direction. He stomps on a Koopa shell and kicks it forward, turning an obstacle into a projectile that clears a line of Goombas, repurposing error outputs as inputs to solve downstream problems. The environment isn’t just a backdrop. It’s a toolbox.
Objectives: What Are We Playing For?
Having an environment is not enough. We need to define what we want to achieve from playing the game. These are the tasks.
Mario can play the same level with completely different objectives. Complete the level. Complete the level in the shortest amount of time. Get the highest score. Collect the most coins. Accumulate the most 1-up lives. Find all the hidden rewards. Stomp on every last Goomba. Each objective produces a fundamentally different play style from the same environment.
This is exactly the task definition problem in agentic AI. The same model operating in the same environment can be pointed at wildly different goals. “Summarize this codebase” and “refactor this codebase” use the same files, the same tools, the same context, but they require entirely different strategies. The task is what transforms an environment from a sandbox into a mission.
The Flagpole: Evaluations and Reward Modeling
At the end of every level, there’s a flagpole. Mario jumps on it, pulls down the flag, and receives a reward. The higher he grabs the flag, the bigger the reward. Some levels end with a boss fight against Bowser, where the reward is freeing a Toad (or eventually, the Princess).
This is the reward signal in reinforcement learning. But here’s the critical question: how do we actually measure how well the game was played? This is evaluation, or reward modeling, and it is where the machine learning engineering discipline really shines. (or research engineer, applied AI engineer. )
The evaluation could be the raw score. It could be the number of 1-ups gained. Coins collected. Goombas stomped. Time remaining. Different paths discovered. Most frequently, it is a combination of some or all of the above, weighted and balanced against each other. Do we reward Mario more for speed or for thoroughness? For survival or for aggression? For finding secrets or for staying on the critical path?
| Evaluation Metric | Mario Measure | Agent Measure |
|---|---|---|
| Speed | Time remaining on the clock | Task completion latency |
| Score | Points accumulated | Overall output quality |
| Collection | Coins gathered | Information retrieved, resources used efficiently |
| Completeness | Hidden blocks found, secrets discovered | Edge cases handled, comprehensive coverage |
| Efficiency | Enemies defeated per life | Correct tool invocations per task |
| Exploration | Different paths taken | Novel approaches discovered |
Designing these rewards is a strict machine learning engineering discipline. A poorly shaped reward function produces an agent that technically completes tasks but in degenerate ways, like a Mario speedrunner who clips through walls. Impressive, but not what we actually wanted. Reward hacking is the Goodhart’s Law of agentic AI: when a measure becomes a target, it ceases to be a good measure.
Learning to Play: Reinforcement Learning
Here’s where the full picture comes together. How does Mario learn to complete levels? Reinforcement learning.
Mario starts each level knowing nothing about its specific layout. He has to:
- Observe the current state, what’s on screen, where the enemies are, what power-ups are available
- Decide on an action based on his policy, jump, run, shoot, or wait
- Act and receive feedback from the environment
- Update his policy based on the outcome
This is the MDP (Markov Decision Process) loop. The same loop described in Agents Are Workflows. The same loop that every agentic AI system runs:
\[v^\pi(s) = \mathop{\mathbb{E}}[r(s, a) + \gamma v_\pi(s^\prime)]\]The value of Mario’s current state equals the expected immediate reward plus the discounted value of the next state. Should Mario jump NOW to get the coin, or wait and avoid the Goomba? The optimal policy $\pi^*$ balances immediate rewards against future outcomes.
Through repeated play (training episodes), Mario learns:
- Which obstacles can be jumped on vs. avoided
- When to use power-ups vs. save them
- Which pipes lead to shortcuts vs. dead ends
- How to handle boss fights
An agentic AI model goes through the same process. Through reinforcement learning (PPO, DPO, GRPO, or whatever the latest acronym is), the model learns:
- Which tools to invoke for which subtasks
- When to think longer vs. act immediately
- Which approaches work for which problem types
- How to decompose complex tasks into manageable steps
Behind the Controller: The Engineers
So who’s actually making all of this work? Mario doesn’t train himself.
To teach agent Mario to be really good at the game, we employ an Machine Learning Engineer (MLE). The MLE is the game designer and coach rolled into one. They construct mock environments, set up the harnesses and tools, define the tasks that Mario needs to achieve, and most importantly, design the reward function. The MLE decides what “good” looks like. Do we reward Mario for speed? Thoroughness? Both? How much? This reward design is the single highest leverage decision in the entire pipeline. Get it right and Mario learns to play beautifully. Get it wrong and Mario learns to exploit glitches.
Once the MLE has designed the training setup, the Machine Learning Systems Engineers (MLSys) take over. They are the ones who actually run the show at scale. They set up the environment and agent Mario across hundreds to hundreds of thousands of environments, tasks and iterations. They manage the compute, the distributed training runs, the data pipelines. They collect the reasoning traces, the sequences of observations, actions, and outcomes from every single episode Mario plays. And from these traces, they run the reinforcement learning algorithms that allow agent Mario to learn from experience.
This is the unsexy but critical part. An MLE can design the most elegant reward function in the world, but without MLSys engineers standing up the infrastructure to run millions of training episodes and collect the resulting data, Mario never gets past World 1-1.
Worlds: Agent Frameworks and Domains
The Mushroom Kingdom isn’t one level. It’s organized into Worlds, each with a distinct theme and set of challenges:
| World | Theme | Agent Domain |
|---|---|---|
| World 1 | Grassland, basics | Simple text tasks, Q&A, summarization |
| World 2 | Desert | Data analysis, structured environments |
| World 3 | Water | Web and API navigation |
| World 4 | Giant Land | Large-scale refactoring, migrations |
| World 5 | Sky | Architecture and system design |
| World 6 | Ice | Debugging in fragile or legacy environments |
| World 7 | Pipe Maze | Complex multi-step workflows with dependencies |
| World 8 | Bowser’s Castle | Full autonomous task completion with adversarial conditions |
Agent frameworks are the World Maps - they organize how the agent navigates between levels (subtasks) and worlds (domains). Some frameworks are rigid, you follow the path laid out. Others let you pick your route. The best ones (like how Super Mario Bros. 3 introduced the overworld map) let the agent make strategic choices about which level to tackle next.
But remember: the model is the product. The World Map (framework) is useful scaffolding, but Mario does the actual playing. As models get stronger, they need less scaffolding. Today’s World 8 is tomorrow’s World 1.
Conclusion
The entire agentic AI stack maps to Mario with surprising fidelity:
| Mario | Agentic AI |
|---|---|
| Small Mario | Base pretrained model |
| Super Mushroom | Model harness and post-training |
| Power-ups | Agent tools and skills |
| Star Power | Engineering manager clearing blockers |
| Levels | Task environments |
| Pipes, bricks, shells | Tool use within the environment |
| Objectives (speed, score, coins) | Task definitions |
| Flagpole / Boss | Reward signal and evaluation |
| Reward design | Reward modeling (ML engineering discipline) |
| Learning to play | Reinforcement learning |
| MLE as game designer | ML engineer designing tasks and rewards |
| MLSys running millions of episodes | ML infrastructure for training at scale |
| World Map | Agent framework |
| Multiplayer | Multi-agent systems |
The next time someone asks you to explain agentic AI, ask them: have you played Mario?
If they can understand that Small Mario needs a mushroom to survive, power-ups to be effective, and practice to master each level, they can understand agentic AI development.
Now go save Princess Peach.
References
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
- Agents Are Workflows
- No Code, Low Code, Real Code
- OpenAI o1 System Card
- Schick et al, 2023
@article{
leehanchung,
author = {Lee, Hanchung},
title = {It's-a Me, Agentic AI},
year = {2026},
month = {02},
day = {18},
howpublished = {\url{https://leehanchung.github.io}},
url = {https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/}
}