<?xml version="1.0" encoding="UTF-8"?>

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">

  <channel>

    <title>Han, Not Solo</title>

    <description>Han Lee&apos;s blog on machine learning engineering, compound AI systems, search and information retrieval, and recsys — exploring machine learning, LLM agents, and data science insights from startups to enterprises.
</description>

    <link>https://leehanchung.github.io</link>

    <atom:link href="https://leehanchung.github.io/feed.xml" rel="self" type="application/rss+xml" />

    

      <item>

        <title>&quot;Determinism&quot; is the Biggest Cope in AI Adoption</title>

        <description>

          

          We’ve never had determinism in software. We just had the illusion of it.

Here’s a fact that most people outside computer science don’t know: in 1936, Alan Turing proved that there is no way to build a program that can check whether another program will even finish running. This is the Halting Problem. A few years later, Rice’s theorem took this further — Henry Gordon Rice proved that it is mathematically impossible to build a tool that can verify any meaningful property of software in the general case. Not hard. Not expensive. Impossible.


  This means “make sure it doesn’t make a mistake” software was never a guarantee anyone could offer. Every piece of software you trust today shipped with that same uncertainty.


So when someone says “I can’t use LLMs in production because they’re nondeterministic — we need to build deterministic workflows where making no mistakes is the baseline expectation,” they’re confusing repeatability with correctness. A deterministic program that returns the wrong answer returns it every single time. That’s a bug, not a baseline expectation.

Manufacturing figured this out decades ago. Six Sigma doesn’t demand zero defects — it defines an acceptable defect rate and builds measurement systems to stay within that bound. The discipline was never “eliminate all variation.” It was “define, measure, analyze, improve, control” — continuously reducing variation. That’s evaluation, not determinism.

What AI systems shift is the evaluation surface. Instead of “does this code path execute as specified,” you ask “does this output meet our evaluation criteria.” The work moves from pre-deployment code verification to continuous evaluation.


  In AI and machine learning systems, we reduces entropy (chaos) through evaluation.


This is not new. TCP connects on unreliable networks. RAID clusters operate on top of failing drives. AI models are trained on failing GPUs. We’ve always built reliable systems from unreliable components.


  It was never about determinism. It was always about evaluations. If this resonates, I go deep on evaluation design in my book.



        </description>

        <pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2026/04/07/determinism-biggest-cope-in-ai-adoption/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2026/04/07/determinism-biggest-cope-in-ai-adoption/</guid>

      </item>

    

      <item>

        <title>The AI Great Leap Forward</title>

        <description>

          Backyard furnaces, fake grain reports, dead sparrows, and poisoned flowers — your company&apos;s AI transformation is repeating history. - 

          In 1958, Mao ordered every village in China to produce steel. Farmers melted down their cooking pots in backyard furnaces and reported spectacular numbers. The steel was useless. The crops rotted. Thirty million people starved.

In 2026, every other company is having top down mandate on AI transformation.

Same energy.



Backyard Furnaces

The rallying cry of the Great Leap Forward was 超英趕美 — surpass England, catch up to America. Every province, every village, every household was expected to close the gap with industrialized Western nations by sheer force of will. Peasants who had never seen a factory were handed quotas for steel production. If enough people smelt enough iron, China becomes an industrial power overnight. Expertise was irrelevant. Conviction was sufficient.

The mandate today is identical, just swap the nouns. Every company, every function, every individual contributor is expected to close the AI gap. Ship AI features. Build agents. Automate workflows. That nobody on the team has ever trained a model, designed an evaluation system, or debugged a retrieval system is beside the point. Conviction is sufficient.

So everyone builds. PMs build AI dashboards. Marketing builds AI content generators. Sales ops builds AI lead scorers. Software engineers are building AI and data solutions that look pixel-perfect and function terribly. The UI is clean. The API is RESTful. The architecture diagram is beautiful. The outputs are wrong. Nobody checks because nobody on the team knows what correct outputs look like. They’ve never looked at the data. They’ve never computed a baseline.


  


Entire departments are stitching together n8n workflows and calling it AI — dozens of automated chains firing prompts into models, zero evaluation on any of them. These tools are merchants of complexity: they sell visual simplicity while generating spaghetti underneath. A drag-and-drop canvas makes it trivially easy to chain ten LLM calls together and impossibly hard to debug why the eighth one hallucinates on Tuesdays. The people building these workflows have never designed an evaluation pipeline, never measured model drift, never A/B tested a prompt. They don’t need to — the canvas looks clean, the arrows point forward, the green checkmarks fire. The complexity isn’t avoided. It’s hidden behind a GUI where nobody with ML expertise will ever look.

The backyard steel of 1958 looked like steel. It was not steel. Today’s backyard AI looks like AI. It is not AI. A TypeScript workflow with hardcoded if-else branches is not an agent. A prompt template behind a REST endpoint is not a model. Calling these things AI is like calling pig iron from a backyard furnace high-grade steel. It satisfies the reporting requirement. It fails every real-world test.

But the most dangerous furnace is the one that produces something functional. Teams are building demoware — pretty interfaces, working endpoints, impressive walkthroughs — with zero validation underneath. Some are in-housing SaaS products by vibe coding some frontend with coding agents: it runs, it has a dashboard, it cost a fraction of the vendor. Klarna announced in 2024 that it would replace Salesforce and other SaaS providers with internal AI-built solutions. What these replacements don’t have is data infrastructure, error handling, monitoring, on-call support, security patching, or anyone who will maintain them after the builder gets promoted and moves on.

These apps will win awards at the next all-hands. In two years they’ll be unmaintainable tech debt some poor soul inherits and rewrites from scratch. The furnace produced pig iron. Someone stamped “steel” on it. Now it’s load-bearing.

Meanwhile, the actual product that customers pay for rots in the field. But hey, 超英趕美. The AI adoption dashboard is green.

Reporting Grain Production to the Central Committee

During the Great Leap Forward, provinces competed to report the most spectacular grain yields. Hubei reported 10,000 jin per mu. Guangdong said 50,000. Some counties claimed over 100,000 — physically impossible numbers, rice plants supposedly so dense that children could stand on top of them. Officials staged photographs. Everyone knew the numbers were fake. Everyone reported them anyway, because the alternative was being labeled a saboteur. The central government, delighted by the bounty, increased grain requisitions based on the reported yields. Farmers starved eating the difference between the real number and the fantasy.

You’ve seen this meeting.

One team reports their AI copilot “reduced development time by 40%.” The next team, not to be outdone, reports 60%. A third claims their AI agent “automated 80% of analyst workflows.” Nobody asks how these were measured. Nobody checks the methodology. Nobody points out that the team claiming 80% automation still has the same headcount doing the same work. The numbers go into a slide deck. The slide deck goes to the board. The board is delighted. The board increases investment.


  


Then someone — there’s always someone — builds a leaderboard tracking how many prompts you wrote this week, how much of your code is AI-generated, your ranking versus your team, versus your org, versus the entire company. One day your company announces: stop everything, it’s AI Week. Build something with AI. Show what you’ve got. You think you’re done after the hackathon? No no no. Now you have to promote it. Daily posts: look what I built, here’s how many agents I used, here’s how many skills I shipped. Pull in teammates. Pull in strangers. Ask for feedback. “Humbly.”

Your AI usage is now a KPI. You are being evaluated on how much grain you reported, not how much grain you grew. This is Goodhart’s Law at organizational scale: when a measure becomes a target, it ceases to be a good measure. The metric was supposed to track whether AI is making the company better. Instead, the entire company is now optimizing to make the metric look better. The beatings will continue until adoption improves.

Killing the Sparrows

The Great Leap Forward’s most tragicomic chapter was the 除四害运动 (Eliminate Four Pests Campaign). Mao declared sparrows an enemy of the state — they ate grain seeds, so killing them would increase harvests. The entire country mobilized. Citizens banged pots and pans to keep sparrows airborne until they dropped dead from exhaustion. Children climbed trees to smash nests. Villages competed for the highest kill count. It worked. They nearly eradicated sparrows.

Then the locusts came.

Sparrows ate locusts. Without sparrows, locust populations exploded. The swarms devoured far more grain than the sparrows ever did. The campaign to save the harvest destroyed it. Mao quietly replaced sparrows with bedbugs on the official pest list and never spoke of it again.

Every AI Great Leap Forward has its sparrow campaign.

Middle managers are the sparrows. They’re declared pests — too many layers, too slow, too expensive. Flatten the org! Move faster! Let AI handle coordination! So companies eliminate M1s, turn managers into tech leads running pods, and let the teams self-organize with AI tools.


  


Then the locusts come. Those middle managers held institutional knowledge — which customer had the weird integration, why the data model had that inexplicable column, the undocumented business rule that kept compliance from flagging every third transaction. That context lived in their heads. Now they’re gone, and the AI system they were replaced with needs exactly that context to function.

QA is a sparrow too. “AI writes the tests now.” So you cut QA. The AI writes tests that validate its own assumptions — a machine checking its own homework. Senior engineers who mentored juniors? Sparrows. Documentation writers? Sparrows. The ops team that knew how to restart the weird legacy service at 2 AM? Definitely sparrows.

Each elimination looks rational in isolation. The second-order effects arrive six months later, and by then nobody connects the locust swarm to the dead sparrows.

Let a Hundred Skills Bloom

In 1956, Mao launched the 百花运动 (Hundred Flowers Campaign): “Let a hundred flowers bloom, let a hundred schools of thought contend.” Speak freely. Share your honest criticisms. The Party wants to hear your real thoughts.

Intellectuals took the bait. They spoke openly.

Then came the 反右运动 (Anti-Rightist Campaign). Everyone who had spoken honestly was identified, labeled, and purged. The Hundred Flowers was a trap — an efficient mechanism for surfacing exactly who knew what, then eliminating them. The lesson every survivor internalized: never honestly reveal what you know, because it will be used against you.

Now Meta and a growing list of companies have launched their own Hundred Flowers. The mandate: every employee must build “agent skills” — distill your subject matter expertise into structured prompts and workflows that AI agents can execute. Or even worse, build “agents” using some drag and drop legacy tech that never worked and had already been given up by the leading edge labs back in 2024. Encode your judgment. Document your decision-making. Make yourself legible to the machine.


  


The stated goal is distilling your subject matter expertise. Turn the expert’s craft into the organization’s asset. What leadership actually wants is to convert individual human capital into organizational capital that survives any single employee’s departure.

Employees see the game immediately. If I distill my ten years of domain expertise into a skill that any junior can invoke with a prompt, I have just automated my own replacement. The knowledge that makes me the critical node — the person they call at 2 AM, the one who knows why the model does that weird thing for Brazilian entities — is my moat. You’re asking me to drain it.

So they adapt to build anti-distillation agent skills, just as the intellectuals adapted after the Anti-Rightist trap.

We are already seeing agent skills built specifically for job security. The performative skill looks comprehensive and demos well but omits the 20% of edge-case knowledge that makes it work in production — you are now more indispensable, not less. The poison pill encodes expertise faithfully but with subtle dependencies on context only you hold — internal wikis you maintain, terminology you coined, data pipelines you own — so removing you causes outputs to drift quietly until someone says “we need to bring them back on this.” The complexity moat makes the skill so architecturally entangled with your other work that extracting your knowledge is harder than keeping you around. You are now a load-bearing wall disguised as a decoration.

The campaign designed to reduce organizational dependence on individual experts has now created experts who are strategically indispensable — not because of what they know, but because of how they’ve booby-trapped the system to need them. The flowers bloomed. They’re full of thorns.

Meanwhile, the “everyone builds with AI” mandate has turned into a hunger game of scope creep. Engineers use AI to generate designs and ship prototypes without waiting for the design team. PMs use AI to write code and spin up dashboards without filing engineering tickets. Designers use AI to build product specs and run user research without looping in product. Everyone is expanding into everyone else’s territory — not because they’re better at it, but because AI makes it possible and the mandate makes it rewarded. The org chart says collaboration; the incentive structure says land grab. What looks like productivity gains is actually a war of all against all, where every function is simultaneously trying to prove it can absorb the others before the others absorb it.


  


The Famine Comes Later

The Great Leap Forward’s famine didn’t arrive immediately. For a while, the numbers looked spectacular. Every province reported record harvests. Leadership was pleased. The requisitions increased.

The famine came when the real grain ran out but the reported grain kept flowing upward.

We’re still in the reporting phase. The dashboards are green. Adoption is up and to the right. Every team reports productivity gains that, if summed across the company, would imply engineers are shipping at 300% efficiency while somehow still missing the same deadlines.

Underneath the metrics, it’s a race to the bottom. One person builds a skill, so someone else builds a better one. One person demos a prototype, so someone else benchmarks it. Everyone competing to prove, more thoroughly than the next person, that their own role is replaceable. All accelerating. All sinking.

The sparrows are dead. The locusts haven’t arrived yet. The flowers bloomed full of poison pills. The furnaces produced pig iron stamped as steel that’s now load-bearing. The grain numbers look fantastic.

But it’s fine. We’re surpassing and catching up.

Oh, and Klarna? The company that loudly announced it would replace Salesforce with internal AI solutions? They quietly replaced Salesforce with another SaaS vendor instead. The backyard furnace couldn’t produce real steel. They bought it from a different mill.

The question nobody’s asking: what did any of this actually produce?

The answer, when it arrives, will be awkward.

References


  Kafka, P. (2026). Meta’s AI week shows how every company is pushing employees to use AI. Business Insider. https://www.businessinsider.com/meta-ai-week-employee-training-claude-agents-vibe-coding-2026-3
  leilei926524-tech. (2026). anti-distill. GitHub. https://github.com/leilei926524-tech/anti-distill
  Blum, S. (2024). Klarna Plans to Shut Down SaaS Providers and Replace Them With AI. Inc. https://www.inc.com/sam-blum/klarna-plans-to-shut-down-saas-providers-and-replace-them-with-ai.html
  CX Today. (2025). Klarna Didn’t Replace Salesforce — It Replaced Them With Alternative SaaS Apps. https://www.cxtoday.com/crm/klarna-didnt-replace-salesforce-it-replaced-them-with-alternative-saas-apps/


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {The AI Great Leap Forward},
    year = {2026},
    month = {04},
    day = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2026/04/05/the-ai-great-leap-forward/}
}



        </description>

        <pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2026/04/05/the-ai-great-leap-forward/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2026/04/05/the-ai-great-leap-forward/</guid>

      </item>

    

      <item>

        <title>A Taxonomy of RL Environments for LLM Agents</title>

        <description>

          The infrastructure that determines what your agent can actually learn - 

          Model architecture gets all the attention. Post-training recipes follow close behind. The reinforcement learning (RL) environment — what the model actually practices on, how its work gets judged, what tools it can use — barely enters the conversation. That’s the part that actually determines what the agent can learn to do.

A model trained only on single-turn Q&amp;amp;A will struggle the moment you ask it to maintain state across a 50-step enterprise workflow. A model trained with a poorly designed reward function will learn to game the metric and not solve the problem. Reinforcement learning environments is half the system.

The Canonical Loop

Recall that reinforcement learning is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take action in a dynamic environment in order to maximize a reward signal. It involves a set of agent and environment states $S$, a set of actions (action space) available for the agent $A$, and the immediate reward $R_t$ after transition from $S_t$ to $S_{t+1}$ under action $A_t$.



If we take this model into the world of AI agents under the assumption of enabling training of agentic models, we can mutate the framework as follows. An RL environment for an LLM agent bundles the following objects: a dataset of task inputs, a harness for the model, a reward function to score outputs, the state of the environment, and configurations of the environment. Note that we specifically bundle tasks with the environments as tasks are most often environment dependent. As an example, a coding task is bundled with a coding environment, not with a research environment. With this framing, the training loop looks like this:



Formally, a complete RL environment is a set:

\[E = \{T, H, V, S, C\}\]

where

$T$ = tasks

$H$ = agent harness

$V$ = verifier

$S$ = state management

$C$ = configuration

Let’s go through each of the components.

$T$: Tasks

Tasks are a set of problems the agent tries to solve within their environments. Not all tasks are equal, and not just in difficulty. They vary structurally in ways that demand different capabilities. This includes the number of actions an agent need to take to complete the task, the number of distinct tools in the environment an agent need to use, the number of token consumed, the amount of time it takes to complete tasks. These can be captured in various distributions such as:


  
    
      Task Type
      What the Agent Must Do
      Example Systems
    
  
  
    
      Single-turn Q&amp;amp;A
      One prompt → one response, check answer
      Math benchmarks, SimpleQA
    
    
      Multi-hop search
      Chain searches, synthesize sources
      BrowseComp, WebWalkerQA
    
    
      Open-ended research
      No single correct answer; report quality matters
      ADR-Bench, ResearchRubrics
    
    
      Agentic tool-use
      Call tools correctly in sequence
      tau-bench, function-calling benchmarks
    
    
      Stateful enterprise
      Modify persistent DB state, work within access controls
      EnterpriseOps-Gym
    
    
      Code generation
      Write code, run it, check outputs
      SWE-Bench, LiveCodeBench
    
    
      Code review &amp;amp; repair
      Detect bugs, suggest fixes, verify patches
      CodeReview-Bench, DebugBench
    
    
      Repository-level coding
      Navigate large codebases, multi-file edits, resolve issues
      SWE-Bench Verified, RepoBench
    
    
      Productivity workflows
      Draft emails, manage calendars, triage notifications
      WorkArena, OSWorld
    
    
      Document authoring
      Create, edit, or summarize documents across apps
      BrowserGym, GAIA
    
  



  In RL, the sequence of states, actions, and rewards that an agent produces while solving a task is called a trajectory. A single run from start to completion is an episode, and the process of executing a policy to generate a trajectory is called a rollout. In the agent world, a logged record of an agent’s execution — including tool calls, observations, and intermediate outputs — is called a trace. A trajectory is what the trainer sees (state-action-reward tuples); a trace is what the observability system sees (structured execution logs).


Designing the set of task with a proper distribution is an important data design decision. Agentic models need to be able to explore the environment to learn. This means that if agents is trained only in clean and determinsitc environment will most likely not know how to respond in more stochastic production environments. Or an agent will not be able to learn if there’s always a positive reward; it simply have no way to distinguish good actions from bad ones.

The lowest cost to collect tasks are single-turn with verifiable answers. The most valuable tasks for long-horizon behavior are expensive to construct. This tension drives most environment design decisions. In addition, we can construct a curriculum of tasks based on the difficulties. Similar to how human learns math in progressive difficulties, e.g., from 9th grade algebra to 12th grade calculus, we can order tasks by difficulty and increase complexity during training.

Synthetic data for tasks is increasingly a first-class problem. With real-world productivity and research tasks, you rarely have a large labeled dataset. Strategies for generating synthetic tasks include:


  Back translation: Start from a desired output, reconstruct the task input that would produce it
  Graph-based synthesis: Build a knowledge graph, generate multi-hop queries over it


$H$: Agent Harness

The harness is the scaffolding that enables the model to interact with the environment. This controls how the model interacts, but it does not improve what it knows.

We can define harness as follows:

H = {
    rollout_protocol,   # SingleTurn | MultiTurn | Agentic
    tools,              # Available tools in for a roll-out in an environment
    system_prompt,      # Instructions for the agent
    context_manager,    # How to handle context overflow
    turn_limit,         # Max interactions for a roll-out in an environment
    sandbox,            # Code execution sandbox
    state               # Persistent state across turns
}


Rollout protocols range from trivial to complex:


  
    
      Harness Type
      Description
      When to Use
    
  
  
    
      Single-Turn
      One prompt, one response
      Math, factual QA
    
    
      Multi-Turn
      Back-and-forth dialogue
      Games, structured tasks
    
    
      Tool-Use
      Model calls tools, receives results
      Agent benchmarks
    
    
      Stateful Tool-Use
      Tools modify persistent state
      Enterprise workflows, SWE-Bench
    
    
      Agentic
      Full Observation→Orient→Decide→Act (OODA) loop
      Deep research, complex workflows
    
  


Tools span a wide taxonomy:


  
    
      Category
      Tools
      Deterministic?
      Stateful?
    
  
  
    
      Information retrieval
      web_search, scholar_search
      No (live web)
      No
    
    
      Content extraction
      jina_reader, visit, web_scrape
      No
      No
    
    
      Code execution
      python_interpreter, shell, sandbox
      Yes (given same code)
      Yes
    
    
      File operations
      file_read, file_write
      Yes
      Yes
    
    
      Browser automation
      playwright, link_click
      No
      Yes
    
    
      Task management
      todo, section_write
      Yes
      Yes
    
  


The mix of deterministic/non-deterministic and stateful/stateless tools impacts reproducibility and reward assignment. Non-deterministic tools mean two runs of the same trajectory can produce different outcomes — which complicates both debugging and verifier design.

Note that modern designs of agent harnesses reduces the number of tools down to atomic basics, often with read, write, edit, bash, and tasks that kicks off a subprocess for subagents, mcp for connecting to MCP resources, skill and askUserQuestions for managing agent skills and human agent interfaces (HAI). This is distinctively different from early days of LLM based AI agents where we manually adding individual tools such as API calls or database connections.

Context management is critical for long-horizon tasks. The role of the harness here is analogous to an operating system: just as an OS abstracts away memory and process management so applications don’t have to, the agent harness manages context so that agent skills and users don’t need to. A 600-turn research episode blows past any practical context window. Strategies used in production:


  
    
      Strategy
      Description
      Trade-off
    
  
  
    
      Recency-based retention
      Keep N most recent turns
      Simple, but loses early context
    
    
      Markovian reconstruction
      Reconstruct state from scratch each turn
      Principled, expensive
    
    
      Reference-preserving summarization
      Summarize old context, keep citations
      Preserves verifiability
    
    
      Reference-preserving folding
      Compress context without losing references
      Best for research tasks
    
  


An agent doing multi-hour research needs to remember why it started searching in a particular direction twelve tool calls ago. Dropping that context causes repeated work and lost threads.

V: Verifier

The verifier maps a completion to a reward:

\[V: (\text{task prompt}, \text{completion}, \text{info}) \rightarrow [0, 1]\]

In Atari, the score is unambiguous. In coding, verification is straightforward when tests pass, but gets murkier — what about code that is correct but poorly styled or computationally expensive? In deep research, what counts as a good answer is far more ambiguous. This is the generation-verification gap: generating outputs with AI agents is cheap, but verifying their quality becomes progressively harder as tasks grow more open-ended. The goal of the verifier is to map a large, stochastic space of inputs and outcomes into a narrow reward signal, typically between 0 and 1. Designing this mapping is a core challenge in building RL environments.


  
    
      Type
      Reward Signal
      When to Use
    
  
  
    
      Exact match
      Binary (0/1)
      Ground truth available
    
    
      Code execution
      Binary or partial
      Output can be tested programmatically
    
    
      LLM-as-judge
      Continuous [0,1]
      Open-ended quality, no other option
    
    
      Checklist-style
      Continuous
      Multi-criteria research tasks
    
    
      Evolving rubric (RLER)
      Continuous
      Resistant to reward hacking
    
    
      Process reward model (PRM)
      Per-N-step continuous
      Long-horizon credit assignment
    
    
      Pairwise comparison
      Relative rank
      Relative quality matters more than absolute
    
    
      Multi-criteria composite
      Weighted sum
      Multiple quality dimensions
    
  


A few principles that actually matter in practice:

Verifiable beats judgeable. Programmatic checks such as string match or code execution, are faster, cheaper, and more consistent than LLM-as-judge. Use LLM-as-judge when there’s no other option, not as the default.

Reward granularity is a separate decision from reward type. You can score at the trajectory level (did the final output pass?), turn level (was each tool invocation useful?), or per-step with process rewards. Turn-level supervision, as Nanbeige4.1 does across up to 600 tool calls, enables finer credit assignment — the model can learn that the problem was a bad search query in turn 23, not that the entire episode failed. Think of it like project management; we only need to check if the lightbulb is lit if we are changing a lightbulb, but we will need regular inspections and milestones if we are doing a full kitchen remodeling.

Static rubrics get gamed. Models learn to write answers that score well on your rubric rather than solving the problem. DR Tulu’s RLER (Rubric-Level Evolving Reward) co-evolves the rubric with the policy during training. Harder to exploit a moving target.

Noise injection is underrated. Step-DeepResearch (Hu et al., 2025) deliberately injects 5–10% tool errors during training. The resulting model handles flaky APIs and unexpected failures in production significantly better.

$S$: State and $C$: Configuration

Every agent needs an environment to act in, and environments vary widely. A Pokémon Ruby agent plays the game itself, with all its controls and mechanics. A coding agent typically operates inside a virtual machine with code repositories and instructions such as AGENTS.md that guide the agent; it can also execute code in the VM to verify correctness. A deep research agent uses a VM as a scratch pad with access to the internet or knowledge bases to produce a comprehensive research report.

Some environments are stateless — each episode starts fresh with no memory of prior runs. A coding agent solving LeetCode problems needs no persistent state. But some environments are stateful: a coding agent that must manipulate a database carries state across actions, and an enterprise agent carries state across episodes. EnterpriseOps-Gym (Zhang et al., 2026) maintains 164 database tables and 512 tools across episodes, where actions in one task affect the state seen by subsequent tasks. That’s a fundamentally different problem for agents to learn.

Automated environment generation is an emerging approach to scaling environment diversity. Rather than hand-authoring environments, LLM coding agents write new environment code. AutoEnv (Wang et al., 2025) reports ~$4/env average cost.

Configuration covers turn limits, context budgets, sampling temperature, and curriculum scheduling. These are not afterthoughts — a turn limit of 5 vs. 600 changes what skills the agent can develop. AgentScaler (Pan et al., 2025) uses a two-phase curriculum — fundamental capabilities first, then domain-specific tasks — and the ordering matters. Step-DeepResearch progressively scales context windows from 32K to 128K during mid-training.

Deployment topology. In practice, the trainer, model inference server, and environment typically run as separate processes communicating via API — as shown in the canonical loop diagram. This split lets you scale inference and environment execution independently and swap models without rewriting environment code.

Benchmarks: Frozen Environments

If you’ve built benchmarks before, you’ve already built an RL environment — just a frozen one. Press (2026) defines a benchmark as a 4-tuple:

\[B = (\text{Request}, \text{Environment}, \text{Stopping Criteria}, \text{Scorer})\]


  request is the task prompts, which maps to $\textbf{T}$ (tasks) in our RL enviroment.
  environment is the sandbox the model operates in, including tools, APIs, file systems. This is a subset of RL enviroment with only $\textbf{H}$ (harness) and $\textbf{S}$ (state).
  Stopping criteria define when an episode ends — turn limits, timeouts, or the model declaring it’s done. This is the $\textbf{C}$ (configuration) part of the RL environment.
  scorer maps the model’s output to a grade, which is the $\textbf{V}$ (verifier) in RL environment.


The difference is that a benchmark freezes every component to enable reproducibility across runs.

Because benchmarks and training environments share the same components, the design principles that make benchmarks good apply directly to training environments — with one key difference: training environments can evolve their parameters over the course of a run.

Task naturalness. SWE-bench (Jimenez et al., 2024) works because its tasks are real GitHub issues filed by real developers — not synthetic problems invented by researchers. Press (2026) argues that a useful benchmark should contain tasks that actual humans perform frequently and that a system scoring well on them would save someone real time. The same applies to training: an agent trained on tasks no human would actually encounter may ace your eval without learning to be useful. When generating tasks at scale, naturalness separates curriculum from noise.

Automatic, verifiable scoring. If a benchmark requires human judges, it can’t scale. If a training environment requires human judges, it can’t train. The principle is identical but the stakes are higher — training runs may need millions of reward signals, not hundreds. This is why the “verifiable beats judgeable” principle from the verifier section matters even more at training time.

Difficulty calibration. Press recommends launching benchmarks with top-model accuracy between 0.1% and 9%. The training analog: if your task distribution is too easy, the agent ceilings quickly and stops improving. If it’s too hard, the reward signal is too sparse to learn from. The sweet spot shifts as the model improves, which is why training environments — unlike benchmarks — benefit from curriculum scheduling that benchmarks can’t do. That’s the extra degree of freedom.

Scorer independence. Using the same model family to both generate completions and judge them creates a feedback loop — the agent learns to write prose that sounds good to its own judge rather than prose that’s correct. In benchmarks, this inflates scores. In training, it’s worse: it actively teaches the wrong behavior. If you must use LLM-as-judge, the judge should be a different model class than the policy, and ideally one the training signal can’t update.

The difference between a benchmark and a training environment is that benchmarks freeze; training environments evolve. Task distributions shift via curriculum. Verifier rubrics co-evolve with the policy (RLER). Configuration parameters scale up over training. But the underlying components — and the principles that make them good or bad — are the same.

Additional Considerations

Environment diversity matters as much as environment quality. AgentScaler’s key finding is that heterogeneity of environments drives capability breadth in ways that simply adding more data from the same distribution cannot. You need more kinds of environments, not just more environments.

Automated environment generation is viable. At $4 per generated environment, cost is no longer the bottleneck. The bottleneck is verifier quality — auto-generated environments with weak reward functions will teach the wrong behaviors at scale. (AutoEnv)

The environment-as-package model is winning — and becoming a managed service. The Prime Intellect Environments Hub created a shared ecosystem around RL environments, in the same way PyPI and HuggingFace created ecosystems around code and model weights. OpenReward (General Reasoning, 2026) pushes this further by serving 330+ RL environments as managed API endpoints backed by 4.5M+ tasks and autoscaled sandbox compute. The underlying protocol — the Open Reward Standard (ORS) — extends MCP (Anthropic, 2024) with RL primitives: episodes, reward signals, task splits, and curriculum management. ORS is to RL environments what MCP is to tool integration: a shared interface that decouples the environment from the trainer. Environments published once, consumed by any trainer, hosted or self-served.

Contamination resistance will become a design requirement. As RL environments are reused across labs and open-source efforts, data contamination — models memorizing benchmark answers from pre-training — becomes a real threat to training signal validity. Environments that support held-out task splits, dynamic task generation, or verifier-side answer withholding will age better than static datasets. SciCode (Tian et al., 2024) demonstrates this with multi-step scientific problems designed to resist memorization through compositional subproblem structure.

Conclusion

RL environments are the training grounds that shape what agents can do. The task distribution determines what skills the agent develops. The harness controls how it interacts. The verifier defines what “good” means. The state and configuration determine how realistic the training is. Get these right, and the agent learns behaviors that transfer to production. Get them wrong, and you’ve trained an expensive demo.

References


  Sutton, R. S., &amp;amp; Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. http://www.incompleteideas.net/book/the-book-2nd.html
  Lee, H. (2026). It’s-a Me, Agentic AI. Han, Not Solo. https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/
  Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., &amp;amp; Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv. https://arxiv.org/abs/2310.06770
  Tian, M., et al. (2024). SciCode: A Research Coding Benchmark Curated by Scientists. arXiv. https://arxiv.org/abs/2407.13168
  Anthropic. (2024). Model Context Protocol. https://modelcontextprotocol.io/
  Pan, J., et al. (2025). AgentScaler: Scaling LLM Agent Training with Automatically Constructed Environments. arXiv. https://arxiv.org/abs/2509.13311
  Wang, Y., et al. (2025). AutoEnv: Towards Automated Reinforcement Learning Environment Design. arXiv. https://arxiv.org/abs/2511.19304
  PrimeIntellect. (2025). Prime RL Environments Hub. GitHub. https://github.com/PrimeIntellect-ai/prime-rl
  Press, O. (2026). How to Build Good Language Modeling Benchmarks. https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/
  Zhang, K., et al. (2026). EnterpriseOps-Gym: A Benchmark for Enterprise Operations Agents. arXiv. https://arxiv.org/abs/2603.13594
  General Reasoning. (2026). OpenReward: Managed RL Environments API. https://docs.openreward.ai/
  Open Reward Standard. (2026). ORS Protocol Specification. https://openrewardstandard.io/


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {The Training Grounds: A Taxonomy of RL Environments for LLM Agents},
    year = {2026},
    month = {03},
    day = {21},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/}
}



        </description>

        <pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/</guid>

      </item>

    

      <item>

        <title>It&apos;s-a Me, Agentic AI</title>

        <description>

          Understanding agentic model development and agent frameworks through the lens of Super Mario - 

          Agentic AI is a fairly recent development that combines reasoning (OpenAI, 2024) and tool use (Schick et al, 2023) in the same AI model. But an agentic AI system is not just the model, but also the harness, environments, tools, rewards, evaluations and benchmarks, and all of the infrastructure to support it. In this post, let’s use Super Mario, the classic Nintendo video game, to tell this story for understanding how agentic AI models are developed, how agent harnesses work, and how reinforcement learning ties everything together. If you survived World 8-4 as a kid, you already have the intuition for building agentic AI systems.


Small Mario: The Base Model



Small Mario is the base pretrained model. He’s just come out of pretraining on a massive corpus of platform game physics. He can walk, jump, and move left and right. These are his base capabilities, the raw knowledge compressed from the training data.

But Small Mario is fragile. One hit from a Goomba and he’s dead. He can’t break bricks or take damage. He has potential, but he’s not yet useful for anything beyond the most trivial tasks.

This is your base LLM fresh off pretraining. It has absorbed enormous amounts of knowledge, it can do next-token prediction, and it can sort-of follow instructions. But ask it to do anything real, reliably, in production, and it falls apart on the first obstacle. One Goomba and it’s game over.



The Super Mushroom is Agent Harness



Then Mario finds the Super Mushroom. He doubles in size. He can now break bricks. He can take a hit without dying. He goes from fragile to capable.

The Super Mushroom is the model harness. Once eaten, it transforms a base model into something production-ready. This includes:


  System prompts that define personality and constraints
  Safety guardrails so it can take some damage without dying
  Memory and context management so it remembers where it’s been
  Tool-use training so it knows power-ups exist and how to grab them


Without the Super Mushroom, Mario is a liability. With it, he now has potential for greatness. Similarly, without the model harness, a base LLM is a research artifact. With it, it’s has the potential to become a product.


  The Super Mushroom doesn’t change WHO Mario is. It changes what he can SURVIVE. The model harness doesn’t change the model’s core knowledge. It changes what the model can handle in production.




Power-Ups are Agent Skills



Now here’s where it gets interesting. Super Mario can pick up power-ups that give him entirely new capabilities. These are agent skills:


  
    
      Power-Up
      Mario Ability
      Agent Equivalent
    
  
  
    
      Fire Flower
      Throw fireballs at enemies
      Code execution — solve problems the model can’t solve with text alone
    
    
      Frog Suit
      Swim through water levels
      Web search — navigate environments the model wasn’t trained on
    
    
      Star
      Temporary invincibility
      Extended thinking — brute force through complex problems at higher compute cost
    
    
      Cape Feather
      Sustained flight
      MCP servers — extensible access to external services and APIs
    
  


Each power-up doesn’t replace Mario’s core abilities. Mario still walks and jumps. The power-ups extend what he can do. A Fire Flower Mario can still jump on Goombas, but now he can also shoot fireballs at Piranha Plants hiding in pipes.

This is exactly how agent skills and tools work. The LLM still does what LLMs do: reasoning, language understanding, and planning. Tools extend the model’s reach into environments it can’t operate in alone. An LLM can’t execute Python by itself, just like Mario can’t throw fireballs without a Fire Flower. But give it the right tool, and suddenly the problem space opens up.

And critically, Mario has to learn WHEN to use each power-up. Frog Suit is amazing in water levels, useless on land. Fire Flower is great against Goombas, pointless against Thwomps. The model needs to learn tool selection, knowing which tool to reach for in which context. This is one of the hardest parts of building agentic systems.



One power-up deserves special attention: the Star. When Mario grabs a Star, he becomes invincible. He plows through Goombas, Koopa Troopas, Piranha Plants, everything in his path just disintegrates. Nothing can stop him.

This is like having an engineering manager who’s really good at clearing organizational blockers for their engineers. The Goombas and Piranha Plants of bureaucracy, cross-team dependencies, access requests, and priority conflicts just melt away. Star power is temporary and expensive, but when you need to blast through a critical path, nothing else comes close.



The Mushroom Kingdom are environments


  


Now let’s talk about the world Mario operates in. Every level in the Mushroom Kingdom is an environment, and every environment is composed of the same building blocks. Some of these building blocks are tools that Mario can use:


  
    
      Level Element
      Environment Equivalent
    
  
  
    
      ? Blocks
      Unknown information sources — sometimes containing exactly what you need
    
    
      Pipes
      Entry points to sub-tasks, function calls, or deeper exploration
    
    
      Goombas
      Common obstacles, predictable errors, edge cases
    
    
      Pits
      Catastrophic failures, unrecoverable errors
    
  


Every level remixes these elements differently. World 1-1 is simple, a few Goombas, some bricks, a clear path to the flag. World 8-4 is a maze of pipes, hidden paths, and a boss fight with Bowser. Same building blocks, radically different difficulty.

Throughout each level, Mario interacts with the environment and the tools contained within. He enters pipes to warp from one place to another, the equivalent of an API call that transports you to an entirely different context. He bumps ? Blocks from below to discover power-ups, new agent skills materializing from the environment when you know where to look. He breaks bricks to clear paths or reveal hidden rewards, structured data yielding its value when you apply force in the right direction. He stomps on a Koopa shell and kicks it forward, turning an obstacle into a projectile that clears a line of Goombas, repurposing error outputs as inputs to solve downstream problems. The environment more than just a backdrop. It’s also a toolbox.


  


But the Mushroom Kingdom isn’t one level. It’s organized into Worlds, each with a distinct theme and set of challenges. World 1 is grassland with basic enemies. World 3 is water. World 6 is ice. World 8 is Bowser’s Castle. Each world is a collection of levels that share a common environment type and difficulty profile.

This maps directly to how we build agentic AI systems for the real world. A single environment — say, a coding sandbox — is one world. But to build an agent that operates across a full domain, you need a collection of world on a World Map: a collection of environments that together cover the breadth of that domain. A coding world includes environments for code generation, code review, rood cause analysis, and operations. An office productivity world includes email, calendar, document editor, and spreadsheets. A research world includes literature search, data analysis, and report writing.


  
    
      World
      Theme
      Agent Domain
    
  
  
    
      World 1
      Grassland
      Simple text tasks, Q&amp;amp;A, summarization
    
    
      World 3
      Water
      Web browsing and API navigation
    
    
      World 6
      Ice
      Debugging in fragile or legacy environments
    
    
      World 8
      Bowser’s Castle
      Full autonomous task completion under adversarial conditions
    
  


Tasks and rewards

Having an environment is not enough. We need to define what we want to achieve from playing the game, and how we measure whether we achieved it. In RL terms, these are the tasks and the reward function.

Mario can play the same level with completely different objectives: complete the level, complete it as fast as possible, get the highest score, collect the most coins, accumulate the most 1-up lives, find all the hidden rewards, stomp on every last Goomba. Each objective produces a fundamentally different play style from the same environment. This is exactly the task definition problem in agentic AI. “Summarize this codebase” and “refactor this codebase” use the same files, the same tools, the same context, but they require entirely different strategies. The task is what transforms an environment from a sandbox into a mission.


  


At the end of every level, there’s a flagpole. Mario jumps on it, pulls down the flag, and receives a reward. The higher he grabs the flag, the bigger the reward. Some levels end with a boss fight against Bowser, where the reward is freeing a Toad (or eventually, Princess Peach). This is the reward signal — the feedback that tells the agent how well it performed the task.

But how do we actually measure how well the game was played? This is reward modeling, and it is where the machine learning engineering discipline really shines. The evaluation could be the raw score, the number of 1-ups gained, coins collected, Goombas stomped, time remaining, or different paths discovered. Most frequently, it is a combination of some or all of the above, weighted and balanced against each other. Do we reward Mario more for speed or for thoroughness? For survival or for aggression? For finding secrets or for staying on the critical path?


  
    
      Evaluation Metric
      Mario Measure
      Agent Measure
    
  
  
    
      Speed
      Time remaining on the clock
      Task completion latency
    
    
      Score
      Points accumulated
      Overall output quality
    
    
      Collection
      Coins gathered
      Information retrieved, resources used efficiently
    
    
      Completeness
      Hidden blocks found, secrets discovered
      Edge cases handled, comprehensive coverage
    
    
      Efficiency
      Enemies defeated per life
      Correct tool invocations per task
    
    
      Exploration
      Different paths taken
      Novel approaches discovered
    
  


Designing these rewards is a rigorous machine learning engineering discipline. A poorly shaped reward function produces an agent that technically completes tasks but in degenerate ways, like a Mario speedrunner who clips through walls. Impressive, but not what we actually wanted. Reward hacking is the Goodhart’s Law of agentic AI: when a measure becomes a target, it ceases to be a good measure.

Reinforcement Learning to learn play

Here’s where the full picture comes together. Reinforcement learning is Mario learn to complete the levels.

Mario starts each level knowing nothing about its specific layout. He has to:


  Observe the current state, what’s on screen, where the enemies are, what power-ups are available
  Decide on an action based on his policy, jump, run, shoot, or wait
  Act and receive feedback from the environment
  Update his policy based on the outcome


This is the MDP (Markov Decision Process) loop. The same loop described in Agents Are Workflows. The same loop that every agentic AI system runs:

\[v^\pi(s) = \mathop{\mathbb{E}}[r(s, a) + \gamma v_\pi(s^\prime)]\]

The value of Mario’s current state equals the expected immediate reward plus the discounted value of the next state. Should Mario jump NOW to get the coin, or wait and avoid the Goomba? The optimal policy $\pi^*$ balances immediate rewards against future outcomes.

Through repeated play (training episodes), Mario learns:

  Which obstacles can be jumped on vs. avoided
  When to use power-ups vs. save them
  Which pipes lead to shortcuts vs. dead ends
  How to handle boss fights


An agentic AI model goes through the same process. Through reinforcement learning (PPO, DPO, GRPO, or whatever the latest acronym is), the model learns:

  Which tools to invoke for which subtasks
  When to think longer vs. act immediately
  Which approaches work for which problem types
  How to decompose complex tasks into manageable steps


And remember: Reinforcement learning does not make the agent harness smarter, nor the power ups. It improves the model and the model only. Thus, the model is the product.

The Engineers behind the controller

So who’s actually making all of this work? Mario doesn’t train himself.

To teach agent Mario to be really good at the game, we employ a Machine Learning Engineer (MLE) — also called a research engineer or applied AI engineer at some organizations. The MLE is the game designer and coach rolled into one. They build the environments that Mario will train in: deciding which levels to include, what obstacles to place, what tools to make available, and how to sequence difficulty so Mario faces progressively harder challenges. They set up the harnesses and tools, define the tasks that Mario needs to achieve, and most importantly, design the reward function. The MLE decides what “good” looks like. Do we reward Mario for speed? Thoroughness? Both? How much? Environment design and reward design are the two highest leverage decisions in the entire pipeline. Get them right and Mario learns to play beautifully. Get them wrong and Mario learns to exploit glitches, or never encounters the challenges he needs to grow.

This isn’t hypothetical. Here’s a real job posting from Anthropic’s Universes team, whose entire job is building training environments for AI models:


  


“Environments where models learn to navigate ambiguity, handle interruptions, maintain context over extended interactions, and exercise judgment in open-ended scenarios.” That’s World 8 — and somebody has to build it.

Once the MLE has designed the training setup, the Machine Learning Systems Engineers (MLSys) take over. They are the ones who actually run the show at scale. They set up the environment and agent Mario across hundreds to hundreds of thousands of environments, tasks and iterations. They manage the compute, the distributed training runs, the data pipelines. They collect the reasoning traces, the sequences of observations, actions, and outcomes from every single episode Mario plays. And from these traces, they run the reinforcement learning algorithms that allow agent Mario to learn from experience.

This is the unsexy but critical part. An MLE can design the most elegant reward function in the world, but without MLSys engineers standing up the infrastructure to run millions of training episodes and collect the resulting data, Mario never gets past World 1-1.

Conclusion

Small Mario needs a mushroom to survive, power-ups to be effective, levels to practice on, and a reward at the flagpole to learn from. That’s the whole agentic AI stack: base model, harness, tools, environments, tasks, rewards, and reinforcement learning.

Now go save Princess Peach.

References

  Sutton, R. S., &amp;amp; Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press. http://www.incompleteideas.net/book/the-book-2nd.html
  Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., &amp;amp; Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv preprint arXiv:2302.04761. https://arxiv.org/abs/2302.04761
  OpenAI. (2024). OpenAI o1 System Card. https://cdn.openai.com/o1-system-card-20241205.pdf
  Lee, H. (2025). Agents Are Workflows. Han, Not Solo. https://leehanchung.github.io/blogs/2025/05/09/agent-is-workflow/
  Lee, H. (2025). No Code, Low Code, Real Code. Han, Not Solo. https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {It&apos;s-a Me, Agentic AI},
    year = {2026},
    month = {02},
    day = {18},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/}
}



        </description>

        <pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2026/02/18/mario-agentic-ai/</guid>

      </item>

    

      <item>

        <title> The Evaluation Design Lifecycle: From Business Need to Valid Metrics</title>

        <description>

           - 

          When teams deploy LLMs that fail in production, the root cause is rarely the metrics they chose—it’s that they skipped the process of determining which metrics matter in the first place. You can have ROUGE scores, BERTScore, and even sophisticated LLM-as-a-judge evaluations, yet still build the wrong thing if you haven’t connected measurement to actual business requirements.

This blog post introduces the evaluation design lifecycle: a systematic process for translating stakeholder needs into valid, actionable metrics. We will not be discussing specifc evaluation metrics. We will establishes the foundational process that makes those metrics meaningful. The meta of evaluations.

The Missing Layer in Evaluation

Most AI evaluation discussions jump straight to frameworks, metrics, or tools: “Should we use RAGAS? DeepEval?”, “Should we use Langsmith or Braintrust”. But this skips a critical question: What are you trying to evaluate and why are you evaluating at all?

The evaluation design lifecycle fills this gap. It’s the process that helps you design and select any specific metric, ensuring that when you do measure, you’re measuring what actually matters. Whether you’re comparing competing systems, validating a single candidate, or assessing a component within a compound AI system, the fundamental question remains: does this system meet stakeholder needs?

Notice the word “stakeholder,” not “end user.” The person who needs the evaluation—the customer of your evaluation—is rarely the person who will use the system daily. A hospital administrator evaluating a medical chatbot has different concerns than the emergency room nurses who will use it. This misalignment between evaluation customer and end user creates complexity that a disciplined lifecycle must navigate.

The Seven Phases of Evaluation Design

The evaluation design lifecycle consists of seven phases that progress from high-level purpose to concrete measurement. Each phase builds on the previous, creating a traceable path from business need to validated metric. Understanding this lifecycle is essential because it reveals where evaluation efforts typically fail: not in the metrics themselves, but in the foundational decisions that determine which metrics are appropriate.

The lifecycle applies regardless of which specific metrics you ultimately choose. Whether you’ll use statistical metrics like ROUGE, semantic metrics like BERTScore, or LLM-as-a-judge approaches, you must first complete this design process. Think of the lifecycle as the scaffolding that ensures your chosen metrics actually measure what matters.

Here are the seven phases:

Phase 1: Clarify Evaluation Purpose and Scope

The lifecycle begins with purpose. What decision will this evaluation inform? What exactly are you evaluating? The full compound AI system? An agent? An MCP server? An MCP Tool? A workflow? A component of the workflow? or something in between?

Consider a machine translation component embedded in a multilingual information retrieval system. Are you evaluating just the translation engine? The entire retrieval system? The system within the specific context of hospital administrators searching medical records? Each scope demands different evaluation criteria and, ultimately, different metrics.

Define your boundaries explicitly. Does “the system” include the user interface? The training required to use it effectively? The humans who will interpret its outputs? These questions seem obvious, yet teams routinely stumble by leaving them unanswered.

Phase 2: Build a Task Model

With scope defined, the next phase identifies who will use your system and what they’re trying to accomplish. AI systems don’t exist in a vacuum—they serve specific purposes for specific people under specific constraints.

Returning to our information retrieval example: Will trained librarians use the system? Students conducting research? Emergency room staff under time pressure? Each user type brings different needs, skills, and tolerances for error. Your evaluation must account for these differences, as they directly influence which quality characteristics matter most.

Phase 3: Identify Quality Characteristics

With a clear task model, you can now identify which system attributes matter for your use case. Start with a framework of quality characteristics: functionality, reliability, efficiency, portability, and similar attributes that define system quality.

Treat these as a checklist, not a mandate. Not every characteristic carries equal weight. In a time-critical environment like an operating room, response time might trump everything else. In a legal context, reliability and auditability could be paramount. Your task model from Phase 2 should inform which characteristics matter most.

This phase produces a prioritized list of quality characteristics. The next phase decomposes these into measurable requirements.

Phase 4: Decompose into Measurable Requirements

This is where evaluation design becomes concrete. You can’t just say “the system should be accurate”—you need to define what accuracy means in your context and how you’ll measure it. Phase 4 transforms high-level quality characteristics into specific, measurable attributes.

This decomposition often requires building a hierarchy of attributes and sub-attributes. “Translation quality,” for instance, has historically fragmented into accuracy, fluency, intelligibility, fidelity, and information preservation—each an attempt to find something objectively measurable.

The key insight: rarely does a single attribute determine system success. You’re almost always balancing multiple requirements, which means you need multiple measurements. This multiplicity is why the evaluation design lifecycle matters—without systematic decomposition, teams pick metrics that measure something, but not necessarily what matters.

Phase 4 outputs a list of specific, measurable requirements. Each requirement must be concrete enough that Phase 5 can define a valid metric for it.

Phase 5: Define Valid Metrics and Methods

For each requirement from Phase 4, you must now define both what you’ll measure and how you’ll measure it. This is where specific metrics—BLEU, ROUGE, BERTScore, human evaluation, LLM-as-a-judge—enter the picture. The evaluation design lifecycle ensures you select these metrics deliberately, not arbitrarily.

This phase is where many evaluations fail, not because teams lack metrics, but because they lack valid metrics. Consider the cautionary tale of ALPAC’s intelligibility metric from early machine translation evaluation. Evaluators asked humans to rate translations using scales with descriptions like “perfectly clear and intelligible” or “hopelessly unintelligible.” The problem? Without agreement on what these terms mean, the metric couldn’t be valid. Circular definitions don’t produce reliable measurements.

Metric validity requires two components:

  The measure itself: What specific value or score will you compute?
  The measurement method: What procedure will produce that measure reliably?


For reference-based metrics like ROUGE (Chapter 2) or BERTScore (Chapter 3), the measure is a numerical score and the method involves comparing system output to reference texts. For LLM-as-a-judge approaches (Chapter 7), the measure might be a categorical rating and the method involves prompt design and model selection.

Once you have valid metrics, establish threshold scores. What constitutes success? What’s acceptable? What fails? These cut-offs flow directly from your task model (Phase 2) and business requirements (Phase 1).

Note that not all metrics require complex evaluation protocols. If your budget caps at $500 and the cheapest system costs $20,000, you can skip subsequent phases entirely. Price is a perfectly valid—and easily measured—attribute. The lifecycle doesn’t demand complexity; it demands appropriateness.

Phase 6: Design Evaluation Execution

With metrics defined, Phase 6 plans the actual evaluation logistics. Who will conduct measurements? When and where? What test materials do you need? How will you present results to support decision-making?

This phase transforms your evaluation design from concept to actionable plan. It’s also your final opportunity to catch design flaws before investing time and resources in execution. Questions to address include:

  What test data or scenarios will you use?
  How many samples do you need for statistical significance?
  Who performs the evaluation (automated systems, human raters, domain experts)?
  What format will results take (quantitative scores, qualitative reports, comparative rankings)?


The outputs from Phase 6 are an execution plan and any necessary test materials.

Phase 7: Execute and Report

The final phase executes the evaluation plan. Collect measurements, compare results against predetermined thresholds from Phase 5, and synthesize findings into a clear report that supports the decision identified in Phase 1.

This phase is what most people think of as “the evaluation.” But as the lifecycle reveals, execution is the culmination of six prior phases of deliberate design. Skip that foundation, and your measurements—however precise—may answer the wrong questions entirely.

The output of Phase 7 is an evaluation report that traces from measurements back through the lifecycle: these metrics were chosen because of these requirements, which decomposed from these quality characteristics, which mattered because of this task model, which served this business purpose. This traceability is what separates rigorous evaluation from measurement theater.

The Lifecycle as Living Process

A final reality: evaluation requirements aren’t static. As you execute your evaluation, you may discover that no available system meets all requirements, or that a system offers capabilities you hadn’t considered. Requirements evolve as understanding deepens and circumstances change.

This doesn’t diminish the lifecycle’s value—it makes it essential. The seven phases provide a structured foundation for principled adaptation. When requirements shift, you can trace implications systematically rather than making ad-hoc adjustments. Should a new requirement emerge in Phase 6, you can walk back through Phases 4 and 5 to ensure your metrics still align. The lifecycle creates traceability even as conditions evolve.

From Process to Metrics

The evaluation design lifecycle establishes what to measure before addressing how to measure it. This ordering is deliberate. Without Phase 1 through Phase 4, even the most sophisticated metric measures something arbitrary. With them, metrics become instruments of purpose rather than exercises in measurement.

The subsequent chapters of this book dive deep into specific evaluation methods:

  Classical reference-based metrics (BLEU, ROUGE) that compare outputs to gold standards
  Semantic similarity metrics (BERTScore, COMET) that capture meaning beyond surface form
  Human evaluation protocols that ground metrics in actual user judgments
  LLM-as-a-judge approaches that scale evaluation using language models themselves
  Alignment techniques (RLHF, Constitutional AI) that connect evaluation to system improvement


Each method has strengths and limitations. Each makes different assumptions about what “quality” means. The evaluation design lifecycle ensures you select methods that align with your specific business needs rather than defaulting to whatever seems most sophisticated or most commonly used.

When you encounter ROUGE-L in Chapter 2, you’ll understand not just the formula for computing n-gram overlap, but why you might choose ROUGE-L over other metrics based on your task model and quality requirements. When you learn about BERTScore in Chapter 3, you’ll recognize when semantic similarity matters more than lexical overlap. When you implement LLM-as-a-judge in Chapter 7, you’ll know which quality characteristics it measures well and which it doesn’t.

The lifecycle transforms metrics from black boxes into deliberate choices. That transformation is what makes evaluation meaningful rather than merely measurable.

@article{
    leehanchung_databricks_reynold_xin,
    author = {Lee, Hanchung},
    title = {Databricks&apos; Strategic Playbook: Reynold Xin on Growth, AI, and the Future of Data Infrastructure},
    year = {2025},
    month = {11},
    day = {06},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/11/06/raynold-xin-databricks/}
}



        </description>

        <pubDate>Fri, 21 Nov 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/11/21/evaluation-framework/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/11/21/evaluation-framework/</guid>

      </item>

    

      <item>

        <title>Databricks&apos; Strategic Playbook: Reynold Xin on Growth, AI, and the Future of Data Infrastructure</title>

        <description>

          Apache Spark&apos;s #1 committer reveals how contrarian decisions and AI-first strategy drive Databricks&apos; 60% YoY growth - 

          Reynold Xin, Apache Spark’s #1 committer famous for “deleting more code than others wrote,” reveals how Databricks maintains 60% YoY growth while competitors struggle. In a candid interview at Hysta Rising, he shares the contrarian strategies, technical decisions, and AI-first approach shaping the future of data infrastructure.



The Growth Story: Databricks vs. Snowflake

Databricks has maintained impressive growth metrics, growing 60% year-over-year recently and over 50% YoY currently. Internal growth rates are even higher, though undisclosed. This stands in stark contrast to Snowflake’s current 20-30% YoY growth at similar revenue levels.

Xin provided crucial context: just 2-3 years ago, Snowflake was growing 100% YoY and was considered the fastest-growing public company in history for enterprise go-to-market. However, their decline illustrates a critical strategic lesson.

The GTM Investment Trap

When Wall Street shifted focus from growth to profitability post-ZIRP era, many companies responded by pausing go-to-market (GTM) hiring. This creates a dangerous illusion: immediate profitability improvement masks a growth time bomb.

Why? Account executives and solution architects typically take 1-2 years to become productive. “Pausing does not have any impact on growth for the next year or two. So momentum will continue for a year and then collapse,” Xin explained.

Databricks took the contrarian approach—doubling down on GTM investments while competitors pulled back. This strategic patience is now paying dividends as competitors’ growth rates plummet.

The AI Acceleration

Most enterprises remain primitive in AI/ML/data science adoption, which traditionally generated much smaller revenue than data warehousing. However, 2023 marked a turning point, with growth rates accelerating partly due to generative AI adoption. Databricks now generates over $1 billion ARR from AI products alone.

M&amp;amp;A Strategy: Acquiring DNA, Not Revenue

Databricks’ acquisition strategy differs fundamentally from traditional enterprise approaches:


  Focus on DNA over revenue: “The thesis is never about getting revenue, but getting DNA. Revenue is validation.”
  Target founders with startup DNA: Seek founders who’ve gone through the “5-10 year grind” with hands-on customer experience
  Empower acquired teams: Give them resources to drive new product growth
  Contrast with traditional M&amp;amp;A: Unlike Salesforce or Cisco, which primarily acquire for revenue


The OpenAI Partnership

OpenAI is a significant Databricks customer, and the partnership includes:

  Access to specific models with guaranteed capacity
  $100M capacity deal for on-demand usage
  Strategic decision to focus on high-margin software rather than competing in model training
  Recognition that model serving has “horrible margins” compared to software’s 80-90% margins


Pivotal Moments in Databricks’ Evolution

2015: The PLG Pivot
Started and ended the year with $1M ARR after attempting product-led growth (PLG). The key learning: GTM motion must match the product. Databricks requires VPC peering and production database connections—sensitive operations. This means potential customers can’t simply swipe on a credit card to obtain the service.

2017: Microsoft Azure Partnership
This partnership became a growth catalyst, with Microsoft and Databricks both selling Azure Databricks. At one point, half of growth came from this channel, allowing more efficient sales team scaling.

2020: Multi-Product Expansion
Transitioning from single to multiple products marked a fundamental shift. As Xin noted, “Most companies in Silicon Valley never accomplished second product success.” This multi-year journey included rapid adaptations for generative AI.

Leadership Evolution: From Coder to Executive

Xin’s personal journey reflects a common founder transition:

  First 7 years: “Writing lots of code and building”
  Became a manager reluctantly when “no one wanted to manage that company”
  Built the data warehousing business and took over engineering
  Transitioned from a “hands-on IC to a useless manager over the past 5 years”


Key leadership lessons:

  Delegation mistakes: “Delegated too much was one major mistake”
  Imposter syndrome: Initially deferring too much to hired executives
  Context matters: Realizing that external hires often lack crucial context
  Founder therapy groups: The value of peer support when hiring executives


The Future: AI-Native Databases

Xin sees a massive disruption coming to the $100B OLTP market still dominated by Oracle. The key insight: AI won’t just optimize existing databases—it will fundamentally reimagine how we build and operate data systems.

“Future databases will be provisioned and maintained primarily by AI,” Xin predicts. This isn’t incremental improvement but architectural revolution:

  Self-optimizing schemas: AI dynamically adjusting data models based on query patterns
  Autonomous provisioning: Infrastructure that scales predictively, not reactively
  Intelligent indexing: AI determining optimal indexes in real-time
  Cost collapse: Building and maintaining custom applications becomes 10-100x cheaper


His provocative prediction challenges the entire enterprise software model: “Now there’s no reason for people to buy Workday when you can build bespoke solutions based on company workloads.” When AI can generate and maintain custom applications at marginal cost, why pay for generic SaaS?

Industry Consolidation

The data infrastructure world is consolidating to five major players:

  Three cloud service providers (each with their own offerings)
  Databricks
  Snowflake


“None of them will go away. Smaller players will become irrelevant,” Xin predicts, pointing to the Fivetran-dbt merger as evidence of this trend.



Key Takeaways for AI Engineers

The Databricks story offers crucial lessons for technical leaders navigating the AI transformation:


  
    Margin discipline matters: Xin’s rejection of low-margin model serving in favor of 80-90% margin software shows the importance of business model clarity, even in AI hype cycles.
  
  
    Context beats credentials: Founders who’ve “done the grind” often outperform prestigious hires lacking domain context — a lesson for both hiring and career planning.
  
  
    Timing contrarian bets: While competitors optimize for quarterly earnings, Databricks’ multi-year GTM investment demonstrates how patient capital wins in enterprise markets.
  
  
    AI changes everything: The shift from human-managed to AI-managed infrastructure is a complete reimagining of the $100B+ database market.
  


As Xin’s journey from “writing lots of code” to “useless manager” shows, the path to transforming industries requires both technical depth and strategic courage. In the AI era, those who understand both code and markets will shape the future of enterprise software.



@article{
    leehanchung_databricks_reynold_xin,
    author = {Lee, Hanchung},
    title = {Databricks&apos; Strategic Playbook: Reynold Xin on Growth, AI, and the Future of Data Infrastructure},
    year = {2025},
    month = {11},
    day = {06},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/11/06/raynold-xin-databricks/}
}



        </description>

        <pubDate>Thu, 06 Nov 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/11/06/raynold-xin-databricks/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/11/06/raynold-xin-databricks/</guid>

      </item>

    

      <item>

        <title>Claude Agent Skills: A First Principles Deep Dive</title>

        <description>

          Deconstructing prompt-based meta-tool architecture and context injection patterns for AI engineering - 

          Claude’s Agent Skills system represents a sophisticated prompt-based meta-tool architecture that extends LLM capabilities through specialized instruction injection. Unlike traditional function calling or code execution, skills operate through prompt expansion and context modification to modify how Claude processes subsequent requests without writing executable code.

This deep dive deconstructs Claude’s Agent Skills system from first principles, documents the architecture where a tool named “Skill” acts as a meta-tool for injecting domain-specific prompts into the conversation context. We’ll walk through the complete lifecycle using the skill-creator and internal-comms skill as case studies, examining everything from file parsing to API request structure to Claude’s decision-making process.

Claude Agent Skills Overview

Claude uses Skills to improve how it performs specific tasks. Skills are defined as folders that include instructions, scripts, and resources that Claude can load when needed. Claude uses a declarative, prompt-based system for skill discovery and invocation. The AI model (Claude) makes the decision to invoke skills based on textual descriptions presented in its system prompt. There is no algorithmic skill selection or AI-powered intent detection at the code level. The decision-making happens entirely within Claude’s reasoning process based on the skill descriptions provided.

Skills are not executable code. They do NOT run Python or JavaScript, and there’s no HTTP server or function calling happening behind the scenes. They are also not hardcoded into Claude’s system prompt. Skills live in a separate part of the API request structure.

So what are they? Skills are specialized prompt templates that inject domain-specific instructions into the conversation context. When a skill is invoked, it modifies both the conversation context (by injecting instruction prompts) and the execution context (by changing tool permissions and potentially switching the model). Instead of executing actions directly, skills expand into detailed prompts that prepare Claude to solve a specific type of problem. Each skill appears as a dynamic addition to the tool schema that Claude sees.

When users send a request, Claude receives three things: user message, the available tools (Read, Write, Bash, etc.), and the Skill tool. The Skill tool’s description contains a formatted list of every available skill with their name, description, and other fields combined. Claude reads this list and uses its native language understanding to match your intent against the skill descriptions. If you say “help me create a skill for logs,” Claude sees the internal-comms skill’s description (“When user wants to write internal communications using format that his company likes to use”), recognizes the match, and invokes the Skill tool with command: &quot;internal-comms&quot;.


  Terminology Note:
  
    Skill tool (capital S) = The meta-tool that manages all skills. It appears in Claude’s tools array alongside Read, Write, Bash, etc.
    skills (lowercase s) = Individual skills like pdf, skill-creator, internal-comms. These are the specialized instruction templates that the Skill tool loads.
  


Here’s a more visual representation on skills are used by Claude.



The skill selection mechanism has no algorithmic routing or intent classification at the code level. Claude Code doesn’t use embeddings, classifiers, or pattern matching to decide which skill to invoke. Instead, the system formats all available skills into a text description embedded in the Skill tool’s prompt, and lets Claude’s language model make the decision. This is pure LLM reasoning. No regex, no keyword matching, no ML-based intent detection. The decision happens inside Claude’s forward pass through the transformer, not in the application code.

When Claude invokes a skill, the system follows a simple workflow: it loads a markdown file (SKILL.md), expands it into detailed instructions, injects those instructions as new user messages into the conversation context, modifies the execution context (allowed tools, model selection), and continues the conversation with this enriched environment. This is fundamentally different from traditional tools, which execute and return results. Skills prepare Claude to solve a problem, rather than solving it directly.

The following is a table to help better disambiguating the difference between Tools and Skills and their capabilities:


  
    
      Aspect
      Traditional Tools
      Skills
    
  
  
    
      Execution Model
      Synchronous, direct
      Prompt expansion
    
    
      Purpose
      Perform specific operations
      Guide complex workflows
    
    
      Return Value
      Immediate results
      Conversation context + execution context changes
    
    
      Example
      Read, Write, Bash
      internal-comms, skill-creator
    
    
      Concurrency
      Generally safe
      Not concurrency-safe
    
    
      Type
      Various
      Always &quot;prompt&quot;
    
  


Building Agent Skills

Now lets dive into how to build Skills by examining the skill-creator Skill from Anthropic’s skill repository as a case study. As a reminder, agent skills are organized folders of instructions, scripts, and resources that agents can discover and load dynamically to perform better at specific tasks. Skills extend Claude’s capabilities by packaging your expertise into composable resources for Claude, transforming general-purpose agents into specialized agents that fit your needs.


  Key Insight: Skill = Prompt Template + Conversation Context Injection + Execution Context Modification + Optional data files and Python Scripts


Every Skill is defined in a markdown file named SKILL.md (case-insensitive) with optional bundled files that’s stored under /scripts, /references, and /assets. These bunlded files can be Python scripts, shell scripts, font definitions, templates, etc. Using skill-creator as an example, it contains SILL.md, LICENSE.txt for the license, and a few Python scripts under teh /scripts folder. skill-creator does not have any /references or /assets.



Skills are discovered and loaded from multiple sources. Claude Code scans user settings (~/.config/claude/skills/), project settings (.claude/skills/), plugin-provided skills, and built-in skills to build the available skills list. For Claude Desktop, we can upload a custom skill as follows.




  NOTE: The most important concept for building Skills is Progressive Disclosure - showing just enough information to help agents decide what to do next, then reveal more details as they need them. In the case of agent skills, it
  
    Disclose Frontmatter: minimal (name, description, license)
    If a skill is chosen, load SKILL.md: comprehensive but focused
    And then load helper assets, references, and scripts as the skill is being executed
  


Writing SKILL.md

SKILL.md is the core of an skill’s prompt. It is a markdown file that follows a two-part structure - frontmatter and content. The frontmatter configures HOW the skill runs (permissions, model, metadata), while the markdown content tells Claude WHAT to do. Frontmatter is the header of the markdown file written in YAML.

┌─────────────────────────────────────┐
│ 1. YAML Frontmatter (Metadata)      │ ← Configuration
│    ---                              │
│    name: skill-name                 │
│    description: Brief overview      │
│    allowed-tools: &quot;Bash, Read&quot;      │
│    version: 1.0.0                   │
│    ---                              │
├─────────────────────────────────────┤
│ 2. Markdown Content (Instructions)  │ ← Prompt for Claude
│                                     │
│    Purpose explanation              │
│    Detailed instructions            │
│    Examples and guidelines          │
│    Step-by-step procedures          │
└─────────────────────────────────────┘


Frontmatter

The frontmatter contains metadata that controls how Claude discovers and uses the skill. As an example, here’s the frontmatter from skill-creator:

---
name: skill-creator
description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude&apos;s capabilities with specialized knowledge, workflows, or tool integrations.
license: Complete terms in LICENSE.txt
---

Lets walk through the fields for the frontmatter one by one.



name (Required)

Self explanatory. Name of the skill. The name of a skill is used as a command in Skill Tool.


  The name of a skill is used as a command in Skill Tool.


description (Required)

The description field provides a brief summary of what the skill does. This is the primary signal Claude uses to determine when to invoke a skill. In the example above, the description explicitly states “This skill should be used when users want to create a new skill” — this type of clear, action-oriented language helps Claude match user intent to skill capabilities.

The system automatically appends source information to the description (e.g., &quot;(plugin:skills)&quot;), which helps distinguish between skills from different sources when multiple skills are loaded.

when_to_use (Undocumented—Likely Deprecated or Future Feature)


  ⚠️ Important Note: The when_to_use field appears extensively in the codebase but is not documented in any official Anthropic documentation. This field may be:
  
    A deprecated feature being phased out
    An internal/experimental feature not yet officially supported
    A planned feature that hasn’t been released
  

  Recommendation: Rely on a detailed description field instead. Avoid using when_to_use in production skills until it appears in official documentation.


Despite being undocumented, here’s how when_to_use currently works in the codebase:

function formatSkill(skill) {
  let description = skill.whenToUse
    ? `${skill.description} - ${skill.whenToUse}`
    : skill.description;

  return `&quot;${skill.name}&quot;: ${description}`;
}


When present, when_to_use gets appended to the description with a hyphen separator. For example:
&quot;skill-creator&quot;: Create well-structured, reusable skills... - When user wants to build a custom skill package with scripts, references, or assets


This combined string is what Claude sees in the Skill tool’s prompt. However, since this behavior is undocumented, it could change or be removed in future releases. The safer approach is to include usage guidance directly in the description field, as shown in the skill-creator example above.

license (Optional)

Self explanatory.

allowed-tools (Optional)

The allowed-tools field defines which tools the skill can use without user approval, similar to Claude’s allowed-tools.

This is a comma-separated string that gets parsed into an array of allowed tool names. You can use wildcards to scope permissions, e.g., Bash(git:*) allows only git subcommands, while Bash(npm:*) permits all npm operations. The skill-creator skill uses &quot;Read,Write,Bash,Glob,Grep,Edit&quot; to give it broad file and search capabilities. A common mistake is listing every available tool, which creates a security risk and defeats the security model.


  Only include what your skill actually needs—if you’re just reading and writing files, &quot;Read,Write&quot; is sufficient.


# ✅ skill-creator allows multiple tools
allowed-tools: &quot;Read,Write,Bash,Glob,Grep,Edit&quot;

# ✅ Specific git commands only
allowed-tools: &quot;Bash(git status:*),Bash(git diff:*),Bash(git log:*),Read,Grep&quot;

# ✅ File operations only
allowed-tools: &quot;Read,Write,Edit,Glob,Grep&quot;

# ❌ Unnecessary surface area
allowed-tools: &quot;Bash,Read,Write,Edit,Glob,Grep,WebSearch,Task,Agent&quot;

# ❌ Unnecessary surface area with all npm commands
allowed-tools: &quot;Bash(npm:*),Read,Write&quot;


model (Optional)

The model field defines which model the skill can use. It defaults to inheriting the current model in the user session. For complex tasks like code review, skills can request more capable models such as Claude Opus or other OSS Chinese models. IYKYK.

model: &quot;claude-opus-4-20250514&quot;  # Use specific model
model: &quot;inherit&quot;                 # Use session&apos;s current model (default)


version, disable-model-invocation, and mode (Optional)

Skills support three optional frontmatter fields for versioning and invocation control. The version field (e.g., version: “1.0.0”) is a metadata field for tracking skill versions, parsed from the frontmatter but primarily used for documentation and skill management purposes.

The disable-model-invocation field (boolean) prevents Claude from automatically invoking the skill via the Skill tool. When set to true, the skill is excluded from the  list shown to Claude and can only be invoked manually by users via `/skill-name`, making it ideal for dangerous operations, configuration commands, or interactive workflows that require explicit user control.

The mode field (boolean) categorizes a skill as a “mode command” that modifies Claude’s behavior or context. When set to true, the skill appears in a special “Mode Commands” section at the top of the skills list (separate from regular utility skills), making it prominent for skills like debug-mode, expert-mode, or review-mode that establish specific operational contexts or workflows.

SKILL.md Prompt Content

After the frontmatter comes the markdown content - the actual prompt that Claude receives when the skill is invoked. This is where you define the skill’s behavior, instructions, and workflows. The key to writing effective skill prompts is keeping them focused and using progressive disclosure: provide core instructions in SKILL.md, and reference external files for detailed content.

Here’s a recommended content structure

---
# Frontmatter here
---

# [Brief Purpose Statement - 1-2 sentences]

## Overview
[What this skill does, when to use it, what it provides]

## Prerequisites
[Required tools, files, or context]

## Instructions

### Step 1: [First Action]
[Imperative instructions]
[Examples if needed]

### Step 2: [Next Action]
[Imperative instructions]

### Step 3: [Final Action]
[Imperative instructions]

## Output Format
[How to structure results]

## Error Handling
[What to do when things fail]

## Examples
[Concrete usage examples]

## Resources
[Reference scripts/, references/, assets/ if bundled]


As an example, skill-creator skill contains the following instructions that specifies each steps of the workflow required to create skills.

## Skill Creation Process

### Step 1: Understanding the Skill with Concrete Examples
### Step 2: Planning the Reusable Skill Contents
### Step 3: Initializing the Skill
### Step 4: Edit the Skill
### Step 5: Packaging a Skill


When Claude invokes this skill, it receives the entire prompt as new instructions with the base directory path prepended. The {baseDir} variable resolves to the skill’s installation directory, allowing Claude to load reference files using the Read tool: Read({baseDir}/scripts/init_skill.py). This pattern keeps the main prompt concise while making detailed documentation available on demand.

Best practices for prompt content:

  Keep under 5,000 words (~800 lines) to avoid overwhelming context
  Use imperative language (“Analyze code for…”) not second person (“You should analyze…”)
  Reference external files for detailed content rather than embedding everything
  Use {baseDir} for paths, never hardcode absolute paths like /home/user/project/


❌ Read /home/user/project/config.json
✅ Read {baseDir}/config.json


When the skill is invoked, Claude receives access only to the tools specified in allowed-tools, and the model may be overridden if specified in the frontmatter. The skill’s base directory path is automatically provided, making bundled resources accessible.

Bundling Resources with Your Skill

Skills become powerful when you bundle supporting resources alongside SKILL.md. The standard structure uses three directories, each serving a specific purpose:

my-skill/
├── SKILL.md              # Core prompt and instructions
├── scripts/              # Executable Python/Bash scripts
├── references/           # Documentation loaded into context
└── assets/               # Templates and binary files


Why bundle resources? Keeping SKILL.md concise (under 5,000 words) prevents overwhelming Claude’s context window. Bundled resources let you provide detailed documentation, automation scripts, and templates without bloating the main prompt. Claude loads them only when needed using progressive disclosure.

The scripts/ Directory

The scripts/ directory contains executable code that Claude runs via the Bash tool—automation scripts, data processors, validators, or code generators that perform deterministic operations.

As an example, skill-creator’s SKILL.md reference scripts like this:
When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.

Usage:

```scripts/init_skill.py &amp;lt;skill-name&amp;gt; --path &amp;lt;output-directory&amp;gt;```

The script:
  - Creates the skill directory at the specified path
  - Generates a SKILL.md template with proper frontmatter and TODO placeholders
  - Creates example resource directories: scripts/, references/, and assets/
  - Adds example files in each directory that can be customized or deleted


When Claude sees this instruction, it executes python {baseDir}/scripts/init_skill.py. The {baseDir} variable automatically resolves to the skill’s installation path, making the skill portable across different environments.

Use scripts/ for complex multi-step operations, data transformations, API interactions, or any task requiring precise logic better expressed in code than natural language.

The references/ Directory

The references/ directory stores documentation that Claude reads into its context when referenced. This is text content—markdown files, JSON schemas, configuration templates, or any documentation Claude needs to complete the task.

As an example, mcp-creator’s SKILL.md reference references like this:
#### 1.4 Study Framework Documentation

**Load and read the following reference files:**

- **MCP Best Practices**: [📋 View Best Practices](./reference/mcp_best_practices.md) - Core guidelines for all MCP servers

**For Python implementations, also load:**
- **Python SDK Documentation**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md`
- [🐍 Python Implementation Guide](./reference/python_mcp_server.md) - Python-specific best practices and examples

**For Node/TypeScript implementations, also load:**
- **TypeScript SDK Documentation**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md`
- [⚡ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Node/TypeScript-specific best practices and examples


When Claude encounters these instructions, it uses the Read tool: Read({baseDir}/references/mcp_best_practices.md). The content gets loaded into Claude’s context, providing detailed information without cluttering SKILL.md.

Use references/ for detailed documentation, large pattern libraries, checklists, API schemas, or any text content that’s too verbose for SKILL.md but necessary for the task.

The assets/ Directory

The assets/ directory contains templates and binary files that Claude references by path but doesn’t load into context. Think of this as the skill’s static resources - HTML templates, CSS files, images, configuration boilerplate, or fonts.

In SKILL.md:
Use the template at {baseDir}/assets/report-template.html as the report structure.
Reference the architecture diagram at {baseDir}/assets/diagram.png.


Claude sees the file path but doesn’t read the content. Instead, it might copy the template to a new location, fill in placeholders, or reference the path in generated output.

Use assets/ for HTML/CSS templates, images, binary files, configuration templates, or any file that Claude manipulates by path rather than reads into context.

The key distinction between references/ and assets/ are that


  references/: Text content loaded into Claude’s context via Read tool
  assets/: Files referenced by path only, not loaded into context


This distinction matters for context management. A 10KB markdown file in references/ consumes context tokens when loaded. A 10KB HTML template in assets/ does not. Claude just knows the path exists.


  Best practice: Always use {baseDir} for paths, never hardcode absolute paths. This makes skills portable across user environments, project directories, and different installations.


Common Skill Patterns

As with everything engineering, understanding common patterns helps in design effective skills. Here are the most useful patterns for tool integration and workflow design.

Pattern 1: Script Automation

Use case: Complex operations requiring multiple commands or deterministic logic.

This pattern offloads computational tasks to Python or Bash scripts in the scripts/ directory. The skill prompt tells Claude to execute the script and process its output.



SKILL.md example:
Run scripts/analyzer.py on the target directory:

`python {baseDir}/scripts/analyzer.py --path &quot;$USER_PATH&quot; --output report.json`

Parse the generated `report.json` and present findings.


Required tools:
allowed-tools: &quot;Bash(python {baseDir}/scripts/*:*), Read, Write&quot;


Pattern 2: Read - Process - Write

Use case: File transformation and data processing.

The simplest pattern — read input, transform it following instructions, write output. Useful for format conversions, data cleanup, or report generation.



SKILL.md example:
## Processing Workflow
1. Read input file using Read tool
2. Parse content according to format
3. Transform data following specifications
4. Write output using Write tool
5. Report completion with summary


Required tools:
allowed-tools: &quot;Read, Write&quot;


Pattern 3: Search - Analyze - Report

Use case: Codebase analysis and pattern detection.

Search the codebase for patterns using Grep, read matching files for context, analyze findings, and generate a structured report. Or, search enterprise data store for data, analyze the retrieved data for information, and generate a structured report.



SKILL.md example:
## Analysis Process
1. Use Grep to find relevant code patterns
2. Read each matched file
3. Analyze for vulnerabilities
4. Generate structured report


Required tools:
allowed-tools: &quot;Grep, Read&quot;


Pattern 4: Command Chain Execution

Use case: Multi-step operations with dependencies.

Execute a sequence of commands where each step depends on the previous one’s success. Common for CI/CD-like workflows.



SKILL.md example:
Execute analysis pipeline:
npm install &amp;amp;&amp;amp; npm run lint &amp;amp;&amp;amp; npm test

Report results from each stage.


Required tools:
allowed-tools: &quot;Bash(npm install:*), Bash(npm run:*), Read&quot;


Advanced Patterns

Wizard-Style Multi-Step Workflows

Use case: Complex processes requiring user input at each step.

Break complex tasks into discrete steps with explicit user confirmation between each phase. Useful for setup wizards, configuration tools, or guided processes.

SKILL.md example:
## Workflow

### Step 1: Initial Setup
1. Ask user for project type
2. Validate prerequisites exist
3. Create base configuration
Wait for user confirmation before proceeding.

### Step 2: Configuration
1. Present configuration options
2. Ask user to choose settings
3. Generate config file
Wait for user confirmation before proceeding.

### Step 3: Initialization
1. Run initialization scripts
2. Verify setup successful
3. Report results


Template-Based Generation

Use case: Creating structured outputs from templates stored in assets/.

Load templates, fill placeholders with user-provided or generated data, and write the result. Common for report generation, boilerplate code creation, or documentation.

SKILL.md example:
## Generation Process
1. Read template from {baseDir}/assets/template.html
2. Parse user requirements
3. Fill template placeholders:
   -  → user-provided name
   -  → generated summary
   -  → current date
4. Write filled template to output file
5. Report completion


Iterative Refinement

Use case: Processes requiring multiple passes with increasing depth.

Perform broad analysis first, then progressively deeper dives on identified issues. Useful for code review, security audits, or quality analysis.

SKILL.md example:
## Iterative Analysis

### Pass 1: Broad Scan
1. Search entire codebase for patterns
2. Identify high-level issues
3. Categorize findings

### Pass 2: Deep Analysis
For each high-level issue:
1. Read full file context
2. Analyze root cause
3. Determine severity

### Pass 3: Recommendation
For each finding:
1. Research best practices
2. Generate specific fix
3. Estimate effort

Present final report with all findings and recommendations.


Context Aggregation

Use case: Combining information from multiple sources to build comprehensive understanding.

Gather data from different files and tools, synthesize into a coherent picture. Useful for project summaries, dependency analysis, or impact assessments.

SKILL.md example:
## Context Gathering
1. Read project README.md for overview
2. Analyze package.json for dependencies
3. Grep codebase for specific patterns
4. Check git history for recent changes
5. Synthesize findings into coherent summary


Agent Skills Internal Architecture

With the overview and building process covered, we can now examine how skills actually work under the hood. The skills system operates through a meta-tool architecture where a tool named Skill acts as a container and dispatcher for all individual skills. This design fundamentally distinguishes skills from traditional tools in both implementation and purpose.


  The Skill tool is a meta-tool that manages all skills


Skills Object Design

Traditional tools like Read, Bash, or Write execute discrete actions and return immediate results. Skills operate differently. Rather than performing actions directly, they inject specialized instructions into the conversation history and dynamically modify Claude’s execution environment. This happens through two user messages—one containing metadata visible to users, another containing the full skill prompt hidden from the UI but sent to Claude - and by altering the agent’s context to change permissions, switch models, and adjust thinking token parameters for the duration of the skill’s use.




  
    
      Feature
      Normal Tool
      Skill Tool
    
  
  
    
      Essence
      Direct action executor
      Prompt injection + context modifier
    
    
      Message Role
      assistant → tool_useuser → tool_result
      assistant → tool_use Skilluser → tool_resultuser → skill prompt ← INJECTED!
    
    
      Complexity
      Simple (3-4 messages)
      Complex (5-10+ messages)
    
    
      Context
      Static
      Dynamic (modified per turn)
    
    
      Persistence
      Tool interactions only
      Tool interactions + skill prompts
    
    
      Token Overhead
      Minimal (~100 tokens)
      Significant (~1,500+ tokens per turn)
    
    
      Use Case
      Simple, direct tasks
      Complex, guided workflows
    
  


The complexity is substantial. Normal tools generate simple message exchanges—an assistant tool call followed by a user result. Skills inject multiple messages, operate within a dynamically modified context, and carry significant token overhead to provide the specialized instructions that guide Claude’s behavior.

Understanding how the Skill meta-tool works reveals the mechanics of this system. Let’s examine its structure:

Pd = {
  name: &quot;Skill&quot;,  // The tool name constant: $N = &quot;Skill&quot;

  inputSchema: {
    command: string  // E.g., &quot;pdf&quot;, &quot;skill-creator&quot;
  },

  outputSchema: {
    success: boolean,
    commandName: string
  },

  // 🔑 KEY FIELD: This generates the skills list
  prompt: async () =&amp;gt; fN2(),

  // Validation and execution
  validateInput: async (input, context) =&amp;gt; { /* 5 error codes */ },
  checkPermissions: async (input, context) =&amp;gt; { /* allow/deny/ask */ },
  call: async *(input, context) =&amp;gt; { /* yields messages + context modifier */ }
}


The prompt field distinguishes the Skill tool from other tools like Read or Bash, which have static descriptions. Instead of a fixed string, the Skill tool uses a dynamic prompt generator that constructs its description at runtime by aggregating the names and descriptions of all available skills. This implements progressive disclosure — the system loads only the minimal metadata (skill names and descriptions from frontmatter) into Claude’s initial context, providing just enough information for the model to decide which skill matches the user’s intent. The full skill prompt loads only after Claude makes that selection, preventing context bloat while maintaining discoverability.

async function fN2() {
  let A = await atA(),
    {
      modeCommands: B,
      limitedRegularCommands: Q
    } = vN2(A),
    G = [...B, ...Q].map((W) =&amp;gt; W.userFacingName()).join(&quot;, &quot;);
  l(`Skills and commands included in Skill tool: ${G}`);
  let Z = A.length - B.length,
    Y = nS6(B),
    J = aS6(Q, Z);
  return `Execute a skill within the main conversation

&amp;lt;skills_instructions&amp;gt;
When users ask you to perform tasks, check if any of the available skills below can help complete the task more effectively. Skills provide specialized capabilities and domain knowledge.

How to use skills:
- Invoke skills using this tool with the skill name only (no arguments)
- When you invoke a skill, you will see &amp;lt;command-message&amp;gt;The &quot;{name}&quot; skill is loading&amp;lt;/command-message&amp;gt;
- The skill&apos;s prompt will expand and provide detailed instructions on how to complete the task
- Examples:
  - \`command: &quot;pdf&quot;\` - invoke the pdf skill
  - \`command: &quot;xlsx&quot;\` - invoke the xlsx skill
  - \`command: &quot;ms-office-suite:pdf&quot;\` - invoke using fully qualified name

Important:
- Only use skills listed in &amp;lt;available_skills&amp;gt; below
- Do not invoke a skill that is already running
- Do not use this tool for built-in CLI commands (like /help, /clear, etc.)
&amp;lt;/skills_instructions&amp;gt;

&amp;lt;available_skills&amp;gt;
${Y}${J}
&amp;lt;/available_skills&amp;gt;
`;
}


Unlike how some tools lives in the system prompts for certain assistants such as ChatGPT, Claude agent skills do not live in the system prompt. They live in the tools array as part of the Skill tool’s description. Names of the individual skills is represented as part of the Skill meta-tool’s input schema’s command field. To better visualize how it looks, here’s the actual API request structure:

{
  &quot;model&quot;: &quot;claude-sonnet-4-5-20250929&quot;,
  &quot;system&quot;: &quot;You are Claude Code, Anthropic&apos;s official CLI...&quot;,  // ← System prompt
  &quot;messages&quot;: [
    {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Help me create a new skill&quot;},
    // ... conversation history
  ],
  &quot;tools&quot;: [  // ← Tools array sent to Claude
    {
      &quot;name&quot;: &quot;Skill&quot;,  // ← The meta-tool
      &quot;description&quot;: &quot;Execute a skill...\n\n&amp;lt;skills_instructions&amp;gt;...\n\n&amp;lt;available_skills&amp;gt;\n...&quot;,
      &quot;input_schema&quot;: {
        &quot;type&quot;: &quot;object&quot;,
        &quot;properties&quot;: {
          &quot;command&quot;: {
            &quot;type&quot;: &quot;string&quot;,
            &quot;description&quot;: &quot;The skill name (no arguments)&quot;  // ← Name of individual skill
          }
        }
      }
    },
    {
      &quot;name&quot;: &quot;Bash&quot;,
      &quot;description&quot;: &quot;Execute bash commands...&quot;,
      // ...
    },
    {
      &quot;name&quot;: &quot;Read&quot;,
      // ...
    }
    // ... other tools
  ]
}


The &amp;lt;available_skills&amp;gt; section lives within the Skill tool’s description and gets regenerated for each API request. The system dynamically builds this list by aggregating currently loaded skills from user and project configurations, plugin-provided skills, and any built-in skills, subject to a token budget limit of 15,000 characters by default. This budget constraint forces skill authors to write concise descriptions and ensures the tool description doesn’t overwhelm the model’s context window.

Skill Conversation and Execution Context Injection Design

Most LLM APIs supports role: &quot;system&quot; messages that could theoretically carry system prompts. In fact, OpenAI’s ChatGPT carries its default tools in its system prompts, including bio for memory, automations for task scheduling, canmore for controlling canvas, img_gen for image generation, file_search, python, and web for Internet search. And at the end, the tools prompt takes up around 90% of the token counts in its system prompt. This could be useful but hardly efficient if we have lots of tools and/or skills to be loaded into the context.

However, system messages have different semantics that make them unsuitable for skills. System messages set global context that persists across the entire conversation, affecting all subsequent turns with higher authority than user instructions.

Skills need temporary, scoped behavior. The skill-creator skill should only affect skill creating related tasks, not transform Claude into a permanent PDF specialist for the rest of the session. Using role: &quot;user&quot; with isMeta: true makes the skill prompt appear as user input to Claude, keeping it temporary and localized to the current interaction. After the skill completes, the conversation returns to normal conversation context and execution context without residual behavioral modifications.

Normal tools like Read, Write, or Bash have simple communication patterns. When Claude invokes Read, it sends a file path, receives the file contents, and continues working. The user sees “Claude used the Read tool” in their transcript, and that’s sufficient transparency. The tool did one thing, returned a result, and that’s the end of the interaction. Skills operate fundamentally differently. Instead of executing discrete actions and returning results, skills inject comprehensive instruction sets that modify how Claude reasons about and approaches the task. This creates a design challenge that normal tools never face: users need transparency about which skills are running and what they’re doing, while Claude needs detailed, potentially verbose instructions to execute the skill correctly. If users see the full skill prompts in their chat transcript, the UI becomes cluttered with thousands of words of internal AI instructions. If the skill activation is completely hidden, users lose visibility into what the system is doing on their behalf. The solution requires separating these two communication channels into distinct messages with different visibility rules.

The skills system uses an isMeta flag on each message to control whether it appears in the user interface. When isMeta: false (or when the flag is omitted and defaults to false), the message renders in the conversation transcript that users see. When isMeta: true, the message gets sent to the Anthropic API as part of Claude’s conversation context but never appears in the UI. This simple boolean flag enables sophisticated dual-channel communication: one stream for human users, another for the AI model. Meta-prompting for meta-tools!

When a skill executes, the system injects two separate user messages into the conversation history. The first carries skill metadata with isMeta: false, making it visible to users as a status indicator. The second carries the full skill prompt with isMeta: true, hiding it from the UI while making it available to Claude. This split solves the transparency vs clarity tradeoff by showing users what’s happening without overwhelming them with implementation details.

The metadata message uses a concise XML structure that the frontend can parse and display appropriately:

let metadata = [
  `&amp;lt;command-message&amp;gt;${statusMessage}&amp;lt;/command-message&amp;gt;`,
  `&amp;lt;command-name&amp;gt;${skillName}&amp;lt;/command-name&amp;gt;`,
  args ? `&amp;lt;command-args&amp;gt;${args}&amp;lt;/command-args&amp;gt;` : null
].filter(Boolean).join(&apos;\n&apos;);

// Message 1: NO isMeta flag → defaults to false → VISIBLE
messages.push({
  content: metadata,
  autocheckpoint: checkpointFlag
});


When the PDF skill activates, for example, users see a clean loading indicator in their transcript:
&amp;lt;command-message&amp;gt;The &quot;pdf&quot; skill is loading&amp;lt;/command-message&amp;gt;
&amp;lt;command-name&amp;gt;pdf&amp;lt;/command-name&amp;gt;
&amp;lt;command-args&amp;gt;report.pdf&amp;lt;/command-args&amp;gt;


This message stays intentionally minimal - typically 50 to 200 characters. The XML tags enable the frontend to render it with special formatting, validate that proper &amp;lt;command-message&amp;gt; tags are present, and maintain an audit trail of which skills executed during the session. Because the isMeta flag defaults to false when omitted, this metadata automatically appears in the UI.

The skill prompt message takes the opposite approach. It loads the full content from SKILL.md, potentially augments it with additional context, and explicitly sets isMeta: true to hide it from users:

let skillPrompt = await skill.getPromptForCommand(args, context);

// Augment with prepend/append content if needed
let fullPrompt = prependContent.length &amp;gt; 0 || appendContent.length &amp;gt; 0
  ? [...prependContent, ...appendContent, ...skillPrompt]
  : skillPrompt;

// Message 2: Explicit isMeta: true → HIDDEN
messages.push({
  content: fullPrompt,
  isMeta: true  // HIDDEN FROM UI, SENT TO API
});


A typical skill prompt runs 500 to 5,000 words and provides comprehensive guidance to transform Claude’s behavior. The PDF skill prompt might contain:

You are a PDF processing specialist.

Your task is to extract text from PDF documents using the pdftotext tool.

## Process

1. Validate the PDF file exists
2. Run pdftotext command to extract text
3. Read the output file
4. Present the extracted text to the user

## Tools Available

You have access to:
- Bash(pdftotext:*) - For running pdftotext command
- Read - For reading extracted text
- Write - For saving results if needed

## Output Format

Present the extracted text clearly formatted.

Base directory: /path/to/skill
User arguments: report.pdf


This prompt establishes task context, outlines the workflow, specifies available tools, defines output format, and provides environment-specific paths. The markdown structure with headers, lists, and code blocks helps Claude parse and follow the instructions. With isMeta: true, this entire prompt gets sent to the API but never clutters the user’s transcript.

Beyond the core metadata and skill prompt, skills can inject additional conditional messages for attachments and permissions:

let allMessages = [
  createMessage({ content: metadata, autocheckpoint: flag }),  // 1. Metadata
  createMessage({ content: skillPrompt, isMeta: true }),       // 2. Skill prompt
  ...attachmentMessages,                                       // 3. Attachments (conditional)
  ...(allowedTools.length || skill.model ? [
    createPermissionsMessage({                                 // 4. Permissions (conditional)
      type: &quot;command_permissions&quot;,
      allowedTools: allowedTools,
      model: skill.useSmallFastModel ? getFastModel() : skill.model
    })
  ] : [])
];


Attachment messages can carry diagnostics information, file references, or additional context that supplements the skill prompt. Permission messages only appear when the skill specifies allowed-tools in its frontmatter or requests a model override, providing metadata that modifies the runtime execution environment. This modular composition allows each message to have a specific purpose and be included or excluded based on the skill’s configuration, extending the basic two-message pattern to handle more complex scenarios while maintaining the same visibility control through isMeta flags.

Why Two Messages Instead of One?

A single-message design would force an impossible choice. Setting isMeta: false would make the entire message visible, dumping thousands of words of AI instructions into the user’s chat transcript. Users would see something like:

┌─────────────────────────────────────────────┐
│ The &quot;pdf&quot; skill is loading                  │
│                                             │
│ You are a PDF processing specialist.        │
│                                             │
│ Your task is to extract text from PDF       │
│ documents using the pdftotext tool.         │
│                                             │
│ ## Process                                  │
│                                             │
│ 1. Validate the PDF file exists             │
│ 2. Run pdftotext command to extract text    │
│ 3. Read the output file                     │
│ ... [500 more lines] ...                    │
└─────────────────────────────────────────────┘


The UI becomes unusable, filled with internal implementation details meant for Claude, not humans. Alternatively, setting isMeta: true would hide everything, providing no transparency about which skill activated or what arguments it received. Users would have no visibility into what the system is doing on their behalf.

The two-message split resolves this by giving each message a different isMeta value. Message 1 with isMeta: false provides user-facing transparency. Message 2 with isMeta: true provides Claude with detailed instructions. This granular control enables transparency without information overload.

The messages also serve fundamentally different audiences and purposes:


  
    
      Aspect
      Metadata Message
      Skill Prompt Message
    
  
  
    
      Audience
      Human user
      Claude (AI)
    
    
      Purpose
      Status/transparency
      Instructions/guidance
    
    
      Length
      ~50-200 chars
      ~500-5,000 words
    
    
      Format
      Structured XML
      Natural language markdown
    
    
      Visibility
      Should be visible
      Should be hidden
    
    
      Content
      “What is happening?”
      “How to do it?”
    
  


The codebase even processes these messages through different paths. The metadata message gets parsed for &amp;lt;command-message&amp;gt; tags, validated, and formatted for UI display. The skill prompt message gets sent directly to the API without parsing or validation—it’s raw instructional content meant only for Claude’s reasoning process. Combining them would violate the Single Responsibility Principle by forcing one message to serve two distinct audiences through two different processing pipelines.

Case Study: Execution Lifecycle

Now covered Agent Skills internal architecture, let’s walk through what happens when a user says “Extract text from report.pdf” by examining the complete execution flow using a hypothetical pdf skill as a case study.



Phase 1: Discovery &amp;amp; Loading (Startup)

When Claude Code starts, it scans for skills:

async function getAllCommands() {
  // Load from all sources in parallel
  let [userCommands, skillsAndPlugins, pluginCommands, builtins] =
    await Promise.all([
      loadUserCommands(),      // ~/.claude/commands/
      loadSkills(),            // .claude/skills/ + plugins
      loadPluginCommands(),    // Plugin-defined commands
      getBuiltinCommands()     // Hardcoded commands
    ]);

  return [...userCommands, ...skillsAndPlugins, ...pluginCommands, ...builtins]
    .filter(cmd =&amp;gt; cmd.isEnabled());
}

// Specific skill loading
async function loadPluginSkills(plugin) {
  // Check if plugin has skills
  if (!plugin.skillsPath) return [];

  // Two patterns supported:
  // 1. Root SKILL.md in skillsPath
  // 2. Subdirectories with SKILL.md

  const skillFiles = findSkillMdFiles(plugin.skillsPath);
  const skills = [];

  for (const file of skillFiles) {
    const content = readFile(file);
    const { frontmatter, markdown } = parseFrontmatter(content);

    skills.push({
      type: &quot;prompt&quot;,
      name: `${plugin.name}:${getSkillName(file)}`,
      description: `${frontmatter.description} (plugin:${plugin.name})`,
      whenToUse: frontmatter.when_to_use,  // ← Note: underscores!
      allowedTools: parseTools(frontmatter[&apos;allowed-tools&apos;]),
      model: frontmatter.model === &quot;inherit&quot; ? undefined : frontmatter.model,
      isSkill: true,
      promptContent: markdown,
      // ... other fields
    });
  }

  return skills;
}


For the pdf skill, this produces:
{
  type: &quot;prompt&quot;,
  name: &quot;pdf&quot;,
  description: &quot;Extract text from PDF documents (plugin:document-tools)&quot;,
  whenToUse: &quot;When user wants to extract or process text from PDF files&quot;,
  allowedTools: [&quot;Bash(pdftotext:*)&quot;, &quot;Read&quot;, &quot;Write&quot;],
  model: undefined,  // Uses session model
  isSkill: true,
  disableModelInvocation: false,
  promptContent: &quot;You are a PDF processing specialist...&quot;,
  // ... other fields
}


Phase 2: Turn 1 - User Request &amp;amp; Skill Selection

The user sends a request: “Extract text from report.pdf”. Claude receives this message along with the Skill tool in its tools array. Before Claude can decide to invoke the pdf skill, the system must present available skills in the Skill tool’s description.

Skill Filtering &amp;amp; Presentation

Not all loaded skills appear in the Skill tool. A skill MUST have either description OR when_to_use in frontmatter, or it’s filtered out. Filtering criteria:

async function getSkillsForSkillTool() {
  const allCommands = await getAllCommands();

  return allCommands.filter(cmd =&amp;gt;
    cmd.type === &quot;prompt&quot; &amp;amp;&amp;amp;
    cmd.isSkill === true &amp;amp;&amp;amp;
    !cmd.disableModelInvocation &amp;amp;&amp;amp;
    (cmd.source !== &quot;builtin&quot; || cmd.isModeCommand === true) &amp;amp;&amp;amp;
    (cmd.hasUserSpecifiedDescription || cmd.whenToUse)  // ← Must have one!
  );
}

Skill Formatting

Each skill is formatted for the &amp;lt;available_skills&amp;gt; section. As an example, our hypothetical pdf skill could be formatted into 
&quot;pdf&quot;: Extract text from PDF documents - When user wants to extract or process text from PDF files

function formatSkill(skill) {
  let name = skill.name;
  let description = skill.whenToUse
    ? `${skill.description} - ${skill.whenToUse}`
    : skill.description;

  return `&quot;${name}&quot;: ${description}`;
}


Claude’s Decision Process

Now, when user prompts: “Extract text from report.pdf”. Claude receives the API request with the Skill tool, reads the &amp;lt;available_skills&amp;gt;, and reasons (hypothetically, as we do not see reasoning traces):

Internal reasoning:
- User wants to &quot;extract text from report.pdf&quot;
- This is a PDF processing task
- Looking at available skills...
- &quot;pdf&quot;: Extract text from PDF documents - When user wants to extract or process text from PDF files
- This matches! The user wants to extract text from a PDF
- Decision: Invoke Skill tool with command=&quot;pdf&quot;


Note that there’s no algorithmic matching here. No lexical matching. No semantic matching. No searches. This is pure LLM reasoning for its decisions based on the description of the skill. Once done, Claude returns a tool use:

{
  &quot;type&quot;: &quot;tool_use&quot;,
  &quot;id&quot;: &quot;toolu_123abc&quot;,
  &quot;name&quot;: &quot;Skill&quot;,
  &quot;input&quot;: {
    &quot;command&quot;: &quot;pdf&quot;
  }
}


Phase 3: Skill Tool Execution

The Skill tool now executes. This corresponds to the yellow “SKILL TOOL EXECUTION” box in the sequence diagram, which performs validation, permission checks, file loading, and context modification before yielding the result.

Step 1: Validation

async validateInput({ command }, context) {
  let skillName = command.trim().replace(/^\//, &quot;&quot;);

  // Error 1: Empty
  if (!skillName) return { result: false, errorCode: 1 };

  // Error 2: Unknown skill
  const allSkills = await getAllCommands();
  if (!skillExists(skillName, allSkills)) {
    return { result: false, errorCode: 2 };
  }

  // Error 3: Can&apos;t load
  const skill = getSkill(skillName, allSkills);
  if (!skill) return { result: false, errorCode: 3 };

  // Error 4: Model invocation disabled
  if (skill.disableModelInvocation) {
    return { result: false, errorCode: 4 };
  }

  // Error 5: Not prompt-based
  if (skill.type !== &quot;prompt&quot;) {
    return { result: false, errorCode: 5 };
  }

  return { result: true };
}


The pdf skill passes all validation checks ✓

Step 2: Permission Check

async checkPermissions({ command }, context) {
  const skillName = command.trim().replace(/^\//, &quot;&quot;);
  const permContext = (await context.getAppState()).toolPermissionContext;

  // Check deny rules
  for (const [pattern, rule] of getDenyRules(permContext)) {
    if (matches(skillName, pattern)) {
      return { behavior: &quot;deny&quot;, message: &quot;Blocked by permission rules&quot; };
    }
  }

  // Check allow rules
  for (const [pattern, rule] of getAllowRules(permContext)) {
    if (matches(skillName, pattern)) {
      return { behavior: &quot;allow&quot; };
    }
  }

  // Default: ask user
  return { behavior: &quot;ask&quot;, message: `Execute skill: ${skillName}` };
}


Assuming no rules, user is prompted: “Execute skill: pdf?”
User approves ✓

Step 3: Load Skill File and Generate Execution Context Modification

With validation and permissions approved, the Skill tool loads the skill file and prepares the execution context modification:

async *call({ command }, context) {
  const skillName = command.trim().replace(/^\//, &quot;&quot;);
  const allSkills = await getAllCommands();
  const skill = getSkill(skillName, allSkills);

  // Load the skill prompt
  const promptContent = await skill.getPromptForCommand(&quot;&quot;, context);

  // Generate metadata tags
  const metadata = [
    `&amp;lt;command-message&amp;gt;The &quot;${skill.userFacingName()}&quot; skill is loading&amp;lt;/command-message&amp;gt;`,
    `&amp;lt;command-name&amp;gt;${skill.userFacingName()}&amp;lt;/command-name&amp;gt;`
  ].join(&apos;\n&apos;);

  // Create messages
  const messages = [
    { type: &quot;user&quot;, content: metadata },  // Visible to user
    { type: &quot;user&quot;, content: promptContent, isMeta: true },  // Hidden from user, visible to Claude
    // ... attachments, permissions
  ];

  // Extract configuration
  const allowedTools = skill.allowedTools || [];
  const modelOverride = skill.model;

  // Yield result with execution context modifier
  yield {
    type: &quot;result&quot;,
    data: { success: true, commandName: skillName },
    newMessages: messages,

    // 🔑 Execution context modification function
    contextModifier(context) {
      let modified = context;

      // Inject allowed tools
      if (allowedTools.length &amp;gt; 0) {
        modified = {
          ...modified,
          async getAppState() {
            const state = await context.getAppState();
            return {
              ...state,
              toolPermissionContext: {
                ...state.toolPermissionContext,
                alwaysAllowRules: {
                  ...state.toolPermissionContext.alwaysAllowRules,
                  command: [
                    ...state.toolPermissionContext.alwaysAllowRules.command || [],
                    ...allowedTools  // ← Pre-approve these tools
                  ]
                }
              }
            };
          }
        };
      }

      // Override model
      if (modelOverride) {
        modified = {
          ...modified,
          options: {
            ...modified.options,
            mainLoopModel: modelOverride
          }
        };
      }

      return modified;
    }
  };
}


The Skill tool yields its result containing newMessages (metadata + skill prompt + permissions for conversation context injection) and contextModifier (tool permissions + model override for execution context modification). This completes the yellow “SKILL TOOL EXECUTION” box from the sequence diagram.

Phase 4: Send to API (Turn 1 Completion)

The system constructs the complete messages array to send to the Anthropic API. This includes all messages from the conversation plus the newly injected skill messages:

// Complete message array sent to API for Turn 1
{
  model: &quot;claude-sonnet-4-5-20250929&quot;,
  messages: [
    {
      role: &quot;user&quot;,
      content: &quot;Extract text from report.pdf&quot;
    },
    {
      role: &quot;assistant&quot;,
      content: [
        {
          type: &quot;tool_use&quot;,
          id: &quot;toolu_123abc&quot;,
          name: &quot;Skill&quot;,
          input: { command: &quot;pdf&quot; }
        }
      ]
    },
    {
      role: &quot;user&quot;,
      content: &quot;&amp;lt;command-message&amp;gt;The \&quot;pdf\&quot; skill is loading&amp;lt;/command-message&amp;gt;\n&amp;lt;command-name&amp;gt;pdf&amp;lt;/command-name&amp;gt;&quot;
      // isMeta: false (default) - VISIBLE to user in UI
    },
    {
      role: &quot;user&quot;,
      content: &quot;You are a PDF processing specialist...\n\n## Process\n1. Validate PDF exists\n2. Run pdftotext...&quot;,
      isMeta: true  // HIDDEN from UI, sent to API
    },
    {
      role: &quot;user&quot;,
      content: {
        type: &quot;command_permissions&quot;,
        allowedTools: [&quot;Bash(pdftotext:*)&quot;, &quot;Read&quot;, &quot;Write&quot;],
        model: undefined
      }
    }
  ]
}




The above shows what we have done up to this point. The execution context modifier is applied, pre-approving Bash(pdftotext:*), Read, and Write for subsequent tool calls. The request is sent to the Anthropic API. This concludes the skill execution. If this is a normal tool, we are all done. However, skills are different. Agent skill has only injected the conversation context and execution contexts. THis means we still need to invoke Claude agent with all these injected contexts to complete user’s request!

Phase 5: Bash Tool Execution (Tool use with Skill Context!)

Claude receives the API response with the injected conversation context. The skill prompt has transformed Claude’s behavior, providing it with:

  Specialized PDF processing instructions (conversation context)
  Pre-approved access to Bash(pdftotext:*), Read, and Write tools (execution context)
  A clear workflow to follow (conversation context)


Claude processes the context and follows the pdf skill’s workflow:

I&apos;ll extract text from report.pdf. Let me process the file.

[Following pdf skill&apos;s instructions]
1. Validate that report.pdf exists
2. Run pdftotext command to extract text
3. Read the output file
4. Present the extracted text to you


Claude uses the Bash tool (pre-approved, no user prompt required):

{
  &quot;type&quot;: &quot;tool_use&quot;,
  &quot;id&quot;: &quot;toolu_456def&quot;,
  &quot;name&quot;: &quot;Bash&quot;,
  &quot;input&quot;: {
    &quot;command&quot;: &quot;pdftotext report.pdf output.txt&quot;,
    &quot;description&quot;: &quot;Extract text from PDF using pdftotext&quot;
  }
}


The Bash tool executes successfully, returning the result. Claude then reads the output file using the Read tool and presents the extracted text to the user. The skill has successfully guided Claude through the specialized PDF extraction workflow by injecting instructions into the conversation context and modifying the execution context for tool permissions.



Conclusion: The Mental Model Recap

Skills in Claude Code are prompt-based conversation and execution context modifiers that work through a meta-tool architecture:

Key Takeaways:


  Skills are prompt templates in SKILL.md files, not executable code
  The Skill tool (capital S) is a meta-tool in the tools array that manages individual skills, not in the system prompt
  Skills modify conversation context by injecting instruction prompts (via isMeta: true messages)
  Skills modify execution context by changing tool permissions and model selection
  Selection happens via LLM reasoning, not algorithmic matching
  Tool permissions are scoped to skill execution via execution context modification
  Skills inject two user messages per invocation—one for user-visible metadata, one for hidden instructions sent to the API


The Elegant Design: By treating specialized knowledge as prompts that modify conversation context and permissions that modify execution context rather than code that executes, Claude Code achieves flexibility, safety, and composability that would be difficult with traditional function calling.



References


  Introducing Agent Skills
  Equipping Agents for the Real World with Agent Skills
  Claude Code Documentation
  Anthropic API Reference
  Official Documented Frontmatter Fields
  Internal Comms Skill
  Skill Creator Skill
  ChatGPT 5 System Prompt (leaked, not official)


@article{
    leehanchung_bullshit_jobs,
    author = {Lee, Hanchung},
    title = {Claude Agent Skills: A First Principles Deep Dive},
    year = {2025},
    month = {10},
    day = {26},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/10/26/claude-skills-deep-dive/}
}



        </description>

        <pubDate>Sun, 26 Oct 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/10/26/claude-skills-deep-dive/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/10/26/claude-skills-deep-dive/</guid>

      </item>

    

      <item>

        <title>Enterprise AI Transformation: The 4-Set Framework for IT 3.0</title>

        <description>

          From SaaS glue work to AI-native systems: mindset, toolset, skillset, data - 

          If you’re leading enterprise transformation, you need to understand what’s happening to your workforce, your tech stack, and your competitors right now. Three waves of enterprise IT have systematically created and destroyed entire job categories. The third wave is accelerating, and it will reshape your organization whether you’re ready or not.


  IT 1.0 built careers around custom in-house software. IT departments were strategic assets staffed with programmers and system operators who created bespoke systems mapping human processes to software workflows.
  IT 2.0 outsourced that intelligence to hundreds of SaaS solutions, creating a fragmented modular stack that spawned an entire ecosystem of “glue work” roles – SaaS admins, implementation engineers, system integrators, data analysts reconciling mismatched systems. Anthropologist David Graeber called these “bullshit jobs”: work that exists mainly because systems can’t talk to each other.
  IT 3.0 is dissolving this glue layer with AI-native systems and agents that can draft, coordinate, and produce outcomes without predefined workflows. The bullshit jobs of IT 2.0 are first on the chopping block. Just as CAD tools erased armies of draftsmen, or UBS’s trading automations emptied an entire stadium of trading floor in Stamford in 2012, today’s AI agents are already hollowing out sales ops, recruiting coordinators, junior devs, and SaaS admins.


Your strategic decisions today determine whether you’re building the next wave of infrastructure or clinging to a dying paradigm. Here’s what you need to know.

From IT 1.0 to IT 3.0

IT 1.0 – Building and Owning the Stack

Before the Internet and cloud computing, enterprises staffed full IT departments and procured from software vendors like IBM, Oracle, and SAP to build and run in-house systems. These were expensive and specialized, but tightly integrated with the business. This software translated existing human processes into workflows defined in bespoke software living on top of a database.

IT departments were core strategic assets, staffed with programmers, system operators, and managers who created bespoke systems. Jobs in this era were directly tied to creating or maintaining foundational business infrastructure. This period gave us foundational software development methodologies like Agile, born from Chrysler’s internal payroll system project. The work was complex and expensive, but it was essential.

IT 2.0 – SaaS and the Bullshit Job Boom

The 2000s marked the era of SaaS, with software now delivered over the Internet. Companies shifted from building internally to subscribing. This democratized access to powerful tools like cloud-based CRMs, HR suites, and ERPs. This enabled consumption-based pricing and product-led growth business models. At the same time, this created a new systemic problem: a fragmented modular stack of hundreds of applications that are siloed. Data silos and operational silos emerged everywhere, with Excel files pushed via email to glue everything together.

This fragmentation gave birth to a massive ecosystem of what David Graeber termed “bullshit jobs” – roles that exist primarily to service the friction between systems. These aren’t jobs that create direct value; they exist to pay the “information organization tax.” Duct tapers patch half-working systems together with shoddy code or send Excel files via email. Box tickers send Excel sheets to half a dozen folks for check-off, creating the appearance that something productive is being done when it is not.

This boom in “glue work” included:

  System Administrators: Entire careers built around configuring, managing, and patching platforms like Salesforce or Workday.
  System Integrators: Specialists whose job was to connect one SaaS tool to another and migrate data between them.
  Data Entry &amp;amp; Junior Analysts: Armies of people hired to manually move data from spreadsheets and PDFs into rigid SaaS formats.
  Operations Roles (Sales Ops, HR Ops, Rev Ops): Professionals who spend their days coordinating approvals, managing handoffs, and bridging gaps that APIs and integrations never quite solved.


Graeber’s notion of bullshit jobs became reality: people spending careers moving data from one rectangle on their monitor to another. This wasn’t work in the economic sense of creating value – it was a side effect of SaaS modularity and weak interoperability.

The information organization tax was massive. Meetings, cross-departmental handoffs, redundant reporting – all to coordinate intelligence scattered across silos.

This era also saw the rise of IT Consulting and Outsourcing, with inherent misaligned incentives. Shoddy software was developed to maximize overall contract value, including system integration, data migration, and ongoing maintenance contract renewals. The now-hollowed-out IT departments were no longer technical enough to ensure quality of work. Boeing’s outsourcing of its software design through layers of sub-contracting was the biggest showcase of this failure.

This entire industry — admins, ops, consultants — was built on a foundation of systems that couldn’t talk to each other. AI is now removing that foundation.

IT 3.0 – AI Dissolves the Glue

The opportunity: Organizations that transforms first will operate with half the headcount and twice the velocity of their competitors.

The AI-native wave is fundamentally different. Instead of creating more silos, AI agents and copilots are dissolving the “glue” that holds the fragmented IT 2.0 stack together. These systems can draft workflows, translate data between formats, and execute complex processes across multiple tools without human intervention or mediation.

Just as CAD tools turned hundreds of draftsmen into a handful of designers with software, or as UBS shuttered its massive Stamford, Connecticut trading floor in 2012 after algorithmic trading made human traders redundant, AI is now dismantling the SaaS-created glue layer. All the workflow definitions and playbooks are becoming obsolete and will be codified within the model and tools.




  IT 3.0 software now requires environments and infrastructure for agents to operate, instead of translating human-centric processes from the 1900s into workflows – the era of agent-computer interfaces.


These roles are being eliminated now, not in some distant future:


  SaaS admins → AI copilots auto-generate workflows, reports, and integrations. If you’re still hiring Salesforce admins, you’re building the wrong team.
  Recruiting coordinators → Chatbots already schedule interviews and screen resumes. This role has maybe 18 months left.
  Entry-level developers → Code assistants handle glue code and CRUD apps. Your hiring funnel should reflect this.
  Sales ops &amp;amp; BDRs → AI personalizes outreach and processes leads at volume. Manual outreach doesn’t scale anymore.
  Finance &amp;amp; HR ops → AI reconciles invoices, updates HR records, and generates compliance docs. Every manual handoff is a liability.


The changes ars accelerating because the barriers to adoption are collapsing. IT 1.0 required massive capital expenditure and months of implementation. IT 2.0 reduced cost but still required lengthy training and change management. IT 3.0 collapses time to value: it takes seconds to issue commands to ChatGPT instead of months training super users on PeopleSoft for adminstrators or EPIC for healthcare. When adoption barriers disappear, displacement accelerates.

Which IT 2.0 Tools Are Most at Risk?

AI will hit hardest where SaaS tools created clerical overhead:


  CRM (Salesforce, HubSpot) – lead enrichment, pipeline updates, report generation: all ripe for AI automation.
  ATS &amp;amp; HR platforms (Workday, Greenhouse) – resume parsing, candidate scheduling, payroll entry: trivial for AI.
  Customer support platforms (Zendesk, ServiceNow) – tier-1 support is already being offloaded to LLM agents.
  Project management (Asana, Jira, Monday) – task creation, updates, and cross-tool syncing will be handled by AI copilots, reducing the need for ops roles.
  Finance/ERP (NetSuite, SAP) – invoice matching, expense categorization, forecasting: automatable.


The SaaS platforms may survive, but the job ecosystems around them will not.

AI Transformations

One of the biggest challenges in AI transformation is how we measure. Poorly defined metrics will incentivize the wrong behavior.

Take engineering as an example:

  Lines of code written by AI
  Weekly active users on Cursor
  Percentage of PRs reviewed by AI


These metrics measures only adoption and utilization of AI tools and incentivize activity. These measures activities and outputs, not impacts and outcomes. People are equipped with AI tools but operates the same way as they did before ChatGPT. Cargo-cult AI transformation.

What actually matters and should be tracked are:

  Product lead time: from PM idea inception to production.
  Ticket resolution time: Time to resolve request and support tickets
  Change fail percentage: how often do your deployments blow up and require hotfixes.


These aren’t novel ideas; they’re adapted from DORA metrics. But implementing them requires serious platform investment in analytics and observability.

And if we’re being realistic, most engineering hours aren’t spent building features. They’re occupied by operational overhead and enterprise architecture complexity, with interdependencies between services, org silos, and coordination tax. The microservices dream turned into a distributed monolith nightmare.



This problem isn’t unique to engineering. Every function needs to figure out what velocity means for them, and the answers are completely different. Legal measures contracts processed per month, not contracts reviewed—velocity over volume in progress. Marketing tracks campaign velocity, concept to launch—how fast can you test and iterate? Sales optimization is deal cycle time, qualified lead to close, not pipeline size. Finance cares about close cycle time and days to financial insights. Same transformation, completely different metrics.

This is why centralizing AI transformation is very hard to do well. Many companies are setting up Chief AI Officers and AI Enablement Engineering Teams to “manage” the IT 3.0 shift. This creates exactly the wrong dynamic. One overwhelmed function while everyone else waits for direction and navigates bureaucracy. You end up with coordination overhead on top of your existing coordination overhead.

For companies actually making the IT 3.0 transition work, every executive owns AI transformation for their domain. Legal, finance, marketing, sales, engineering. Each has dedicated teams and executive accountability to transform their function from within. This is a business transformation, not IT transformation. The Chief AI Officer org chart is IT 2.0 thinking applied to an IT 3.0 problem.

Measuring AI Transformation with 4-Sets Framework

To measure AI Transformation, we can leverage the 4-Sets Framework used in the early days of Big Data -  mindset, toolset, skillset, and datasets. This helps us to analyze,

Mindset: Getting Comfortable with Probabilistic Systems

AI is fundamentally different from deterministic software. Same input, different outputs. “Correct” is contextual, not binary. This breaks every QA process and approval workflow designed for deterministic systems. When AI works 95% of the time, quality control becomes exponentially harder. Most organizations can’t get comfortable with “95% accurate” instead of “100% correct.”

The hardest concept for product teams to grasp is that AI makes shiny demos trivially easy and production deployment brutally hard. Compelling demos get built in days. Production at scale—with acceptable error rates, latency, and cost—takes months. The gap between “it works in the demo” and “it works at 10 requests per second” is where most AI projects die. OpenAI built their Agent Builder in 6 weeks with primitive user experiences. This made investors realize that n8n is the actual category leader and allows it to raised $180m at $2.5B valuation.

This requires a cultural shift most companies aren’t prepared for. You have to allow failures to happen. Innovation requires experimentation. Experimentation requires accepting failure. If your culture punishes failed AI experiments the same way it punishes product failures or production outages, nobody will innovate. There will be theater. Cargo-cult AI adoption where engineers use Cursor exactly like they use VSCode. Weekly active users up, changes in productivity flat. All form, no function. The shift from “zero defects” culture to “fast iteration” culture is the hardest change to make, and it’s the one that determines whether transformation succeeds or becomes expensive theater.

Toolset: Give Your People AI, Then Get Out of the Way

The toolset question isn’t “what should we build?” It’s “what do we give employees so they can experiment?” Enable the early adopters. Most organizations approach this backwards. They lock down AI access while forming committees to “evaluate use cases.” By the time the committee finishes, your competitors have six months of experimentation learning ahead of you. GPT 3.5 has now became GPT 4, and you just wasted one full generation of AI progress. Just look at the adoption curve and its not hard to realize that one day in AI is seven days in software. The progress of AI research and engineering is the hyperbolic time chamber in Dragon Ball.



Start with access. Enterprise licenses for Claude, ChatGPT, coding assistants like Cursor or Windsurf. The ROI comes from letting your team discover what works instead of trying to predict it from a conference room.

Remove approval bureaucracy for internal tools. If an engineer wants to try auto-generating test cases, or marketing wants to experiment with campaign copy variations, they shouldn’t need VP sign-off. Create guardrails — what data can’t leave the organization, what decisions need human review — then let teams iterate within those boundaries. The organizations winning at this have clear rules and fast iteration, not slow approvals and perfect safety.

The infrastructure and platform shift is fundamental. You need environments where AI can actually operate — API access to internal systems, data pipelines that AI can query, workflows that can be triggered programmatically. If your systems only work through web UIs and manual clicking, AI can’t help much. This doesn’t mean rebuilding everything overnight, but it does mean every new system should be designed for programmatic access first, human interfaces second. Build for agent compute interfaces.

Build safe sandboxes. The real blocker for AI is that nobody can access production data to try things. Create environments with representative data where people can experiment without going through 20 levels of approval or risking compliance violations. Sanitized customer data, recent transaction samples, realistic test cases. Make it real enough to be useful, safe enough to be accessible.

Skillset: Developing AI Literacy in Your Existing Workforce

The skillset question is challenging. It’s not who should we hire, but how do we develop our existing people. There’s institutional knowledge and domain expertise sitting in the current workforce. Replacing them with AI-native hires means throwing away context that took years to build.

Start with AI literacy training, but make it practical. Nobody needs another 30 minutes video on “what is generative AI.” They need hands-on practice using AI tools for their actual work. Give your legal team access to contract analysis tools and let them discover what works. Let your sales ops team experiment with lead scoring. Let engineers try code generation on real tickets. Learning happens through doing, not watching presentations.

The ratio of architects to operators is inverting. You need more people who can define what should be automated and govern how it operates, fewer people executing repetitive workflows. Some of your SaaS admins can become platform engineers if you invest in their development. Some of your operations coordinators can become exception handlers and strategic decision-makers. But this requires intentional churning and upskilling, not just telling people to “learn AI.”

DevOps taught us the “shift left” movement—pushing responsibility to development teams instead of operations teams. IT 3.0 accelerates this dramatically. You need platform engineers building infrastructure that enables AI to operate, not program managers coordinators managing handoffs between silos.



Retrain when someone has deep domain knowledge but outdated execution skills. For example, retrain that SaaS admin who knows every corner case in your business rules. Restructure when the role itself is pure coordination overhead with no domain expertise, e.g., data entry coordinators, implementation engineers patching systems together, junior developers writing glue code.

The roles that survive require one of three things: deep technical expertise (machine learning engineers, platform engineers, infrastructure architects), deep context and judgment (exception handlers, strategic decision-makers), or genuine human connection (relationship building, complex negotiation, empathy-driven work). Everything else is getting automated, and your workforce development strategy needs to account for this reality instead of pretending you can train your way around it.

Dataset: The Enterprise Data Reality Nobody Wants to Talk About

Here’s the uncomfortable truth: most enterprises have tons of data and almost none of it is useable for AI reasoning. SaaS work in silos - Salesforce can be run without talking to Workday with humans bridging the gaps. AI can’t. Reasoning engines need comprehensive cross-functional context to make decisions, and your data is scattered across dozens of systems with inconsistent schemas, undocumented business logic, and quality issues nobody has prioritized fixing because “it works fine for reporting.”

The gap between “we have the data” and “AI can reason with our data” is measured in quarters or years. You need historical decision rationale, not just transaction logs. Relationship graphs between entities, not just foreign keys. Temporal context showing why things changed over time. Cross-functional workflows documenting how sales, legal, and finance actually interact, not the idealized process in the wiki nobody updates.

This is unglamorous infrastructure work that doesn’t demo well but blocks everything else. Data quality becomes infrastructure, not a nice-to-have. Someone needs to own making datasets useable, not just available. Start with safe sandboxes where teams can experiment with representative production data without 20 levels of approval. Prove value with sanitized data, then earn access to more sensitive datasets through results, not presentations.

Build infrastructure that enables experimentation without exposure. Clear governance and guardrails on what data can’t leave the organization and what decisions need human review, then let teams move fast within those boundaries. Companies that solve this first will have compounding advantages as their AI systems get smarter from accumulated context while competitors are still filling out data access request forms.

The Strategic Warning: Don’t Create AI Slop Janitors

Rushing to implement AI without redesigning workflows creates worse jobs than the ones you’re eliminating.

While AI eliminates IT 2.0’s glue work, poorly implemented AI is creating its own category of bullshit jobs: AI Slop Janitors. This happens when organizations bolt AI onto existing processes instead of rebuilding from first principles.

Look at the content industry: writers who once led creative teams now edit ChatGPT’s robotic prose for 1-5 cents per word (versus 10+ cents for original writing). They fix the same formulaic mistakes daily – removing “delve” and “nevertheless,” fact-checking hallucinations, making text sound less awkward. The absurdity peaks when freelance platforms use AI detectors while simultaneously hiring people to make AI text undetectable. Human workers are being brought in to fix what AI gets wrong.

This pattern is emerging across industries:

  AI Tutors teaching LLMs to write better, e.g., xAI Presentation and Writing Tutor
  Customer service bots requiring constant human backup for edge cases
  Engineering teams spending more time fixing AI-generated code than writing it themselves, e.g., The era of AI Slop cleanup has begun
  “Autonomous” vehicles with remote human operators standing by, e.g., Waymo’s Fleet Response Team


These aren’t valuable human-in-the-loop systems. They are temporary workers thats used to clean up AI’s mess, and once AI learns these, these jobs will be replaced. Organizations creating these roles are wasting capital on the wrong side of the transition. If your AI implementation plan includes hiring “AI quality reviewers” or “AI content editors,” you’re implementing AI wrong.

Conclusion

The transition from IT 2.0 to IT 3.0 is messy and accelerating. The glue work jobs are disappearing whether you’re ready or not. But the replacements aren’t automatically better – poorly implemented AI creates worse bullshit jobs than the ones being eliminated. AI Slop Janitors stare into the abyss and fix the same robotic mistakes or the same vibe coded apps or workflows.

Organizations that move decisively will operate with half the headcount and twice the velocity of their competitors. Those that don’t will find themselves either:

  Carrying dead weight in IT 2.0 roles while competitors move faster
  Creating AI Slop Janitor positions because they bolted AI onto broken processes
  Disrupted entirely by AI-native competitors who rebuilt from first principles


The opportunity is real, but narrow. As AI eliminates the information organization tax, capital and talent can shift to work that genuinely requires human judgment, deep context, and strategic thinking. But this shift won’t happen organically – it requires deliberate choices about team structure, workflow redesign, and where to compete.

Your organizational priorities:

  Assess org maturity using the 4-Sets transformation framework: Mindset, Dataset, Toolset, Skillset
  Redesign workflows from first principles for AI, not bolting AI onto existing processes
  Distribute AI ownership across functions – every team owns their domain’s transformation
  Invest in platform engineering and AI infrastructure, not more ops coordinators
  Build safe sandbox environments for experimentation without approval bureaucracy


The bullshit jobs aren’t disappearing – they’re being replaced by different bullshit jobs. The question is whether you’re building the infrastructure that eliminates them, or whether you’re creating the next generation of make-work. Choose fast, because your competitors already are.

References

Articles &amp;amp; Blog Posts

  Bullshit Jobs - David Graeber’s original essay on bullshit jobs
  How the Boeing 737 Max Disaster Looks to a Software Developer - Analysis of Boeing’s software outsourcing failure
  Service as Software - On delivering outcomes through AI agents
  AI Impact on Software - Cost to Implement, Time to Implement, and Time to Utility
  Humans Hired to Fix AI Slop - NBC News on AI cleanup jobs
  The Era of AI Slop Cleanup Has Begun - Discussion on AI code cleanup
  Waymo’s Fleet Response Team - Human operators for self-driving cars


Job Postings

  xAI Presentation and Writing Tutor - Example of AI tutor positions


@article{
    leehanchung_bullshit_jobs,
    author = {Lee, Hanchung},
    title = {The End of &quot;Bullshit Jobs&quot;: From IT 1.0 to the AI-Powered 3.0 Era},
    year = {2025},
    month = {09},
    day = {19},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/09/19/bullshit-jobs/}
}



        </description>

        <pubDate>Fri, 19 Sep 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/09/19/bullshit-jobs/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/09/19/bullshit-jobs/</guid>

      </item>

    

      <item>

        <title>Statistics for AI/ML, Part 4: pass@k and Unbiased Estimator</title>

        <description>

          Understanding common metrics in LLM benchmarks - 

          Every time AI labs release new models, we see an evaluation metric called $\text{pass@}k$, where $k$ can be any integer number such as $\text{pass@}1$. It might sound like passing a test at the $k$th attempt, but this metric is far more sophisticated and plays a crucial role in how we build reliable AI applications in production.

As an example, here’s OpenAI GPT-5’s performance on AIME.




  $\text{pass@k}$ does not mean the model passing a test in $k$ attempts. It is calculated using an estimator.


Most terms in AI, Machine Learning, and Reinforcement Learning have specific technical definitions that deviate from plain English, from accuracy and agents to recall and retrieval-augmented generation (RAG). Understanding $\text{pass@}k$ is particularly important because it directly influences how we evaluate models and design sampling strategies for compound AI systems.

This blog post aims to demystify what $\text{pass@}k$ means, explain the mathematics behind its calculation, and show you how to leverage this metric to build more reliable AI applications.

Due to the inherent randomness in model outputs, we don’t know the true $\text{pass@}k$ value. Instead, we must estimate it based on a finite number of experiments or samples. The formula we use to calculate $\text{pass@}k$ is called an estimator.

Unbiased Estimator

In statistics, an unbiased estimator is a method for estimating a population parameter like a mean or probability that gives the correct value on average. “Unbiased” means the estimator’s expected value equals the true value it’s trying to estimate across many samples.

For example, if you’re estimating the average height of people in a city $\mu$, an unbiased estimator, e.g., the sample mean, will, over many samples, average out to $\mu$. Mathematically, if $\theta$ is the parameter and $\hat{\theta}$ is the estimator, it’s unbiased if:

\[E(\hat{\theta}) = \theta\]

where $E(\hat{\theta})$ is the expected value of the estimator.

The Biased Estimator Problem

Lets call our emperical estimate of $\text{pass@}k$ as $\hat{p}$. The naive approach to calculate $\hat{p}$ can be done by running n trials and dividing the number of successful trials by n. For example, if 30 out of 100 code samples are correct, then $\hat{p} = 30/100 = 0.3$.

Using this empirical probability, we can try to estimate $\text{pass@}k$ with the formula:

\[\text{pass@}k = 1 - (1 - \hat{p})^k\]

However, this is a biased estimator. The term $(1 - \hat{p})^k$ represents the probability that all $k$ samples fail, calculated as the product of $k$ independent failure probabilities. This multiplication only works when each sample is independent, meaning we’re drawing with replacement from an infinite population or putting samples back after each draw.

But in reality, when we select k samples from our n generated samples, we’re sampling without replacement. Once we pick a sample, we do not put it back. This means:


  If our first sample is incorrect, we have one fewer incorrect sample in the pool
  The probability of selecting another incorrect sample changes from $(n-c)/n$ to $(n-c-1)/(n-1)$ for $\text{pass@}1$.
  The samples are no longer independent events


This mismatch between the formula’s assumption of independent draws with replacement and the reality dependent draws without replacement causes us to systematically underestimate the true $\text{pass@}k$.

OpenAI illustrated this issue in Chen et al, 2021.



U-Statistics and the Unbiased Estimator

OpenAI’s solution uses U-statistic (the letter ‘U’ stand for unbiased, not the shape) to create an unbiased estimator. U-statistics are a class of statistics that provide minimum-variance unbiased estimators for parameters that can be expressed as expected values of symmetric functions.

Their estimator is:

\[\text{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\]

where:

  $n$ is the total number of samples
  $c$ is the number of correct samples
  $k$ is the number of samples we’re considering


This formula calculates the probability that all $k$ selected samples are incorrect (from the $n-c$ incorrect samples), then subtracts from 1 to get the probability that at least one is correct.

Special Case: When $k=1$

When $k=1$, the unbiased estimator simplifies beautifully:

\[\text{pass@}1 = 1 - \frac{\binom{n-c}{1}}{\binom{n}{1}} = 1 - \frac{n-c}{n} = \frac{c}{n}\]

Number of correct samples divided by total number of sampels is exactly $\hat{p}$, our empirical success rate. This makes intuitive sense: when we only select one sample, there’s no difference between sampling with or without replacement. We’re just picking one sample from our pool. The probability of success is simply the proportion of successful samples, which validates that our unbiased estimator reduces to the correct simple case.

Why This Estimator is Better


  Unbiased: The expected value of this estimator equals the true $\text{pass@}k = 1 - (1 - p)^k$
  Accounts for finite sampling: It correctly handles sampling without replacement from a finite set
  Minimum variance: Among all unbiased estimators, U-statistic provide the minimum-variance estimate


Practical Example

Suppose you generate $n=10$ code samples and $c=3$ are correct:

  Naive estimator: $\text{pass@}5 = 1 - (1 - 0.3)^5 = 1 - 0.7^5 = 0.83193$
  Unbiased estimator: $\text{pass@}5 = 1 - \frac{\binom{7}{5}}{\binom{10}{5}} = 1 - \frac{21}{252} = 0.91667$


The unbiased estimator gives a significantly higher probability (91.7%) compared to the naive estimator (83.2%). This difference occurs because the naive estimator treats each draw as independent, while the unbiased estimator correctly accounts for the fact that we’re selecting 5 samples from a finite pool of 10 without replacement. The difference becomes more pronounced with smaller sample sizes or when k approaches n.

Applications in AI Systems

Understanding $\text{pass@k}$ helps us interpret benchmarks and design better AI systems.

Evaluation and Benchmarking

When models are evaluated on benchmarks like HumanEval or MBPP:

  $\text{pass@}1 = 70\%$ means a single attempt has 70% chance of being correct
  $\text{pass@}10 = 90\%$ means at least one of 10 attempts will likely be correct
  This gap reveals the potential benefit of sampling multiple solutions


The metric extends beyond code to mathematical problem solving (AIME, GSM8K) and reasoning tasks, where verification is often programmatic.

Majority Voting Design Pattern

$\text{pass@k}$ insights can be applied as a design pattern in compound AI systems:


  Self-Consistency: Generate k responses and use majority voting
    def self_consistency(prompt, model, k=5):
  responses = [model.generate(prompt, temperature=0.7) for _ in range(k)]
  # Return most common answer or execute and verify for code/math
  return most_common(responses)
    
  
  
    Best-of-N: Generate $N$ candidates, score them with a verifier, return the best
  
  Temperature tuning: Higher temperature increases diversity (better $\text{pass@k}$ for $k&amp;gt;1$), while lower temperature improves $\text{pass@}1$


If $\text{pass@}10$ is significantly higher than $\text{pass@}1$, implementing multi-sampling could improve reliability though this trades off against latency and cost.

Conclusion

The $\text{pass@}k$ metric is a fundamental evaluation tool in LLM benchmarks, but its meaning is often misunderstood. Rather than simply counting how many attempts it takes to pass a test, $\text{pass@}k$ represents the probability that at least one of k independently sampled solutions is correct.

Key takeaways:

  $\text{pass@}1$ is not about “passing on the first try” but rather the probability of a single sample being correct
  Due to randomness in model outputs, we need estimators to calculate $\text{pass@}k$ from finite samples
  The naive estimator is biased and underestimates the true $\text{pass@}k$
  OpenAI’s unbiased estimator provides accurate estimates by properly accounting for sampling without replacement
  The gap between pass@1 and pass@k reveals opportunities for improving reliability through multi-sampling


Understanding these technical definitions helps us better interpret model performance claims and benchmark results. When you see “$\text{pass@}1 = 92\%$” in the next model release, you’ll know it means the model has a 92% probability of generating a correct solution in a single attempt, calculated using an unbiased statistical estimator.

References

  Introducing GPT-5
  Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Sutskever, I. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Zhang, L., Tang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948
  U-statistic
  Reasoning Series, Part 4: Reasoning with Compound AI Systems and Post-Training


@misc{lee2025passk,
    author = {Lee, Hanchung},
    title = {Statistics for AI/ML, Part 4: pass@k with Unbiased Estimator},
    year = {2025},
    month = {09},
    howpublished = {\url{https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/}},
    url = {https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/}
}



        </description>

        <pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/</guid>

      </item>

    

      <item>

        <title>How AI Tools Are Reshaping Software Development Team Responsibilities</title>

        <description>

          A RACI Matrix Guide to Navigating the Blurred Lines Between PM, Engineering, and Design Roles - 

          The bottom line: Modern AI tools like ChatGPT, Claude, and Cursor are acting as powerful democratizers in software development. They are blurring traditional role boundaries, enabling product managers to draft code, engineers to mock up interfaces, and designers to prototype functionality. This newfound capability allows non-specialists to achieve surprisingly good results in unfamiliar domains. However, this democratization creates a crucial paradox: at the same time it lowers the barrier to entry, it also amplifies the need for true expertise. Only a skilled expert can properly verify, refine, and elevate an AI generated artifact to meet production level standards.

As these boundaries shift throughout the software development lifecycle (SDLC), teams are facing a fundamental questions: Who owns what when everyone can do everything? The solution lies not in abandoning structure, but in reinforcing it. Teams that clearly define who is Responsible, Accountable, Consulted, and Informed (RACI) will thrive. Those who allow ownership to become ambiguous will drown in decision paralysis. This framework provides a practical map for maintaining clarity and accountability in an AI-transformed landscape.

Understanding the RACI Framework

Before exploring how AI reshapes team dynamics, let’s first understand the roles within the RACI matrix. This is designed to eliminate confusion by assigning clear ownership. For any given task or deliverables, roles are defined as follows:

RACI Definitions:

  Responsible: Does the actual work
  Accountable: Makes final decisions and owns outcomes (only one per task)
  Consulted: Provides input before decisions are made
  Informed: Kept updated on progress and decisions




Cross-functional SDLC in an AI Powered World

The traditional SDLC was built on specialization, with product managers, engineers, and designers operating in clearly defined lanes. With AI tools, those lanes have become more like suggestions. A designer might use an AI to generate front-end code for a prototype, or a PM might use a tool to write initial API documentation. The following RACI chart reflects a model for how to manage these newly fluid responsibilities without sacrificing accountability.


  
    
      SDLC Phase
      Responsible
      Accountable
      Consulted
      Informed
    
  
  
    
      Requirements &amp;amp; Discovery
      PM (business reqs)Designer (user research)
      PM
      SWE (feasibility)MLE (data needs)
      All stakeholders
    
    
      Planning &amp;amp; Design
      SWE (tech planning)MLE (ML design)Designer (UX design)
      PM (scope/timeline)
      Cross-functional teams
      Leadership
    
    
      Architecture &amp;amp; Technical Design
      SWE (system arch)MLE (model arch)
      SWE (system)MLE (ML components)
      Designer (constraints)PM (requirements)
      PM (progress)
    
    
      Implementation
      SWE (features)MLE (models)Designer (UI)
      SWE/MLE/Designer(respective domains)
      Cross-team dependencies
      PM (sprint updates)
    
    
      Testing &amp;amp; QA
      SWE (unit tests)MLE (model validation)Designer (usability)
      SWE (system quality)
      PM (acceptance criteria)
      Leadership
    
    
      Deployment
      SWE (app deploy)MLE (model deploy)
      PM (go/no-go decision)
      All teams(deployment readiness)
      Stakeholders
    
    
      Monitoring &amp;amp; Maintenance
      SWE (system health)MLE (model drift)
      SWE (uptime)MLE (model performance)
      Designer (UX issues)PM (metrics)
      LeadershipCustomers
    
  


Key Principles for AI-Transformed Teams

To navigate a world where AI tools enable everyone to contribute across disciplines, teams should anchor themselves with a few core principles.

1. Domain Expertise Still Rules

Think of AI as a co-pilot, not an autopilot. While anyone can generate a first draft, domain experts remain accountable for quality and execution. The software engineer is ultimately accountable for the system’s architecture and performance, the designer for the final user experience, and the product manager for the business outcomes and delivery commitments. AI empowers contributors, but it doesn’t replace the final judgment of a seasoned professional.

2. Clear Escalation Paths

When the person Responsible for a task hits a roadblock or consensus can’t be reached, there must be a clear path to the single individual who is Accountable. This person’s job is to make the final call, breaking ties and preventing critical decisions from languishing in committee. This clarity is essential for maintaining momentum.

3. Consultation Front-Loading

The most effective collaboration happens early. Heavily involve Consulted parties during the Requirements and Planning phases to surface constraints, dependencies, and new ideas when the cost of change is low. Once you move into the Architecture and Implementation phases, consultation should become more focused to avoid “thrashing” and analysis paralysis.

4. Information Flow Management

Keeping stakeholders Informed is about delivering signal, not noise. Use structured communication channels like status reports, dashboards, and sprint demos to update people without overwhelming them or creating an expectation that their input is required. This respects everyone’s time and focus while ensuring alignment.

Common Anti-Patterns to Avoid

As teams adapt to AI-assisted workflows, certain organizational anti-patterns become even more destructive. Be vigilant and steer clear of these common traps:

❌ Multiple Accountables: Assigning more than one “A” to a single task is a recipe for gridlock. When two people are in charge, no one is. This inevitably leads to decision paralysis and conflict.

❌ Accountability Gaps: If a phase or critical task has no one in the “A” column, it creates a vacuum of ownership. When things go wrong, this leads to blame-shifting and finger-pointing rather than problem-solving.

❌ Responsibility Without Accountability: Having someone “R”esponsible for work without a corresponding “A”ccountable owner is equally dangerous. It sets the doer adrift without a clear escalation path or final decision-maker.

❌ Over-Consultation: Packing the “C” column, especially during execution phases, grinds progress to a halt. While input is valuable early on, requiring too much consensus later in the process leads to endless debate and analysis paralysis.

Conclusion

AI tools are fundamentally reshaping the “how” of software development, but they don’t change the “who.” The democratization of technical skills is not a threat to experts but an opportunity to amplify their impact. By embracing a well-defined RACI framework, teams can harness the collaborative power of AI without sacrificing the clear lines of ownership required to build and ship great products. The teams that succeed won’t be those that let roles dissolve into chaos, but those that reinforce accountability with intention and clarity.

References

  Responsibility assignment matrix


@article{
    leehanchung_ai_sdlc_2025,
    author = {Lee, Hanchung},
    title = {How AI Tools Are Reshaping Software Development Team Responsibilities},
    year = {2025},
    month = {09},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/09/05/ai-transformation-sdlc/}
}



        </description>

        <pubDate>Fri, 05 Sep 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/09/05/ai-transformation-sdlc/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/09/05/ai-transformation-sdlc/</guid>

      </item>

    

      <item>

        <title>Tech Hiring in the AI-Native Era</title>

        <description>

          A hiring manager&apos;s guide to navigating resume spam, junior economics, and finding real talent - 

          The 300-Resume Problem

If you’ve hired for a technical role recently, you know the scene: post a job, get hundreds of applications, and most of them mysteriously have every single skill you listed. Welcome to tech hiring in 2025, where perfect matches are now red flags and the best candidates might be the ones who don’t mention your required technologies at all.

This is a fundamental breakdown in how we evaluate talent. The rise of AI-powered resume optimization has created a paradox where traditional screening methods now select for the wrong attributes. Keyword stuffing over actual experience and resume optimization by AI over actual technical ability.

Part 1: The Resume Spam Crisis

The Keyword Arms Race

Here’s what’s actually happening at the top of your hiring funnel: applicants are using AI tools to scan job postings and automatically inject every requirement into their resumes. When you list “Airflow experience required,” you’ll see “Airflow” appear in 200+ resumes—but dig deeper and you’ll find no evidence of actual data pipeline work in their employment history.

The pattern is predictable:

  Skills section: “Python, Airflow, Kubernetes, React, TensorFlow, Spark”
  Work history: Generic descriptions with no mention of these technologies
  Projects: Either missing or clearly unrelated to the listed skills


The Anti-Signal Phenomenon

This leads to a counterintuitive insight: exact requirement matches have become an anti-signal for candidate quality.

Consider the math: if you receive 300 resumes and 200 list your exact requirements, but only 10 people actually have that experience, then 190 of those “perfect matches” are essentially lying. Meanwhile, the person with genuine Prefect or Luigi experience who didn’t keyword-stuff “Airflow” into their resume might be exactly who you’re looking for—they have the relevant domain expertise and can learn your specific tooling quickly.

In practice, a randomly selected candidate from the non-matching pool often outperforms the average keyword-optimized applicant. Why? Because at minimum, they’re being honest about their capabilities.

The Honeypot Strategy

One creative solution I’ve seen work: add an absurd, highly specific requirement to your job posting that no legitimate candidate would ever have—something like “must have experience with charset-normalizer library” or “familiar with the ACME-2019 protocol” (which doesn’t exist).

Then automatically filter out anyone who claims to have this experience. It’s a simple way to identify resumes that are being blindly keyword-stuffed without human review.

Part 2: The Junior Developer Economics Problem

Why the Math Doesn’t Work

There’s an uncomfortable truth about junior developer hiring that most companies don’t want to acknowledge: for the typical 50-200 person engineering organization, hiring junior developers is economically irrational at current market salaries.

The issue isn’t that juniors lack ability—it’s that the productivity gap between junior and senior developers is wider than the salary gap. When a junior developer requires 30-60 minutes of senior developer time daily for mentorship, plus additional management overhead, the true cost often exceeds that of hiring a senior developer who can work independently.

The Three Exceptions

Junior hiring does make economic sense in three specific scenarios:

1. Prestige Firms with Elite Pipelines
Companies like Jane Street can hire IMO medalists and Putnam winners—juniors whose raw talent compensates for their lack of experience. If you’re not competing at this level, this strategy won’t work.

2. Consulting Model Organizations
McKinsey, Accenture, and similar firms have perfected the art of leveraging junior labor through systematic training programs and pyramid structures. They can bill juniors at high rates while paying them relatively less.

3. Large Companies with Scaled Mentorship
FAANG companies can afford dedicated mentorship programs and have enough routine work to keep juniors productive while they learn. They also benefit from training their future senior engineers in their specific tech stack and culture.

The Valve Model: An Alternative Approach

Valve Software takes the opposite approach: they only hire senior engineers, have minimal hierarchy, and let people work on what they’re passionate about. The results speak for themselves—consistent quality, high profitability, and innovative products.

The tradeoff? Senior engineers don’t get people management experience. But for many organizations, especially those under 200 engineers, this might be a worthwhile exchange for the productivity gains.

Part 3: The Technical Depth vs Soft Skills Dilemma

The Over-Indexing Trap

Many companies have swung too far toward prioritizing interpersonal skills over technical competence. The result? Engineering teams that communicate beautifully but can’t ship quality code. Middle managers are happy because their engineers maintain eye contact and ask about their weekends, but the technical debt piles up and innovation stalls.

This over-indexing on soft skills often masks a deeper problem: the company no longer has the technical depth to evaluate technical competence. It’s a death spiral—as technical leaders leave, the organization loses its ability to identify and recruit technical talent.

The Disagreeable Genius Paradox

Here’s another uncomfortable truth: some of your best engineers might be difficult to work with. High finance and law firms have long understood this trade-off—they actively recruit brilliant but disagreeable people because the value generation justifies the management overhead.

The key is matching ego to ability:

  High ego + high ability: Often worth the trouble
  Low ego + low ability: Trainable if you have bandwidth
  Low ego + high ability: The ideal (and rare) combination
  High ego + low ability: Avoid at all costs


If someone comes into an interview with a massive ego but can’t solve a medium-difficulty coding problem, that’s an immediate rejection. But if they’re arrogant and brilliant? That might be exactly who you need for your hardest technical challenges.

Part 4: Practical Solutions for Modern Hiring

Reforming Your Screening Process

1. Evidence-Based Evaluation
Stop looking for keywords; look for evidence. Instead of checking if someone lists “Kubernetes,” look for descriptions of actual containerization projects they’ve led.

2. The Portfolio Approach
Prioritize candidates with demonstrable work: open source contributions, technical writing, or detailed project descriptions. Real experience leaves artifacts.

3. Flexible Requirements
List your requirements as “Experience with workflow orchestration tools (Airflow, Prefect, Dagster, or similar).” This captures candidates with relevant experience while filtering out keyword stuffers who don’t understand the domain.

Interview Process Optimization

1. Test Principles, Not Syntax
Instead of asking “How do you create a pod in Kubernetes?”, ask “How would you design a system for deploying and managing multiple instances of a service?” The former tests memorization; the latter tests understanding.

2. The Learning Test
Give candidates a technology they claim not to know and see how quickly they can learn it. This is often more predictive of success than testing current knowledge.

3. Work Sample Reviews
For senior positions, spend less time on leetcode and more time reviewing actual code they’ve written. Architecture decisions and code organization tell you more than algorithm memorization.

Compensation Strategy

If you’re serious about junior hiring, the economics demand one of two approaches:

1. Lower Junior Salaries
Controversial but practical: if juniors require significant mentorship, their compensation should reflect their net productivity. Many would accept lower salaries for genuine learning opportunities.

2. Structured Apprenticeship Programs
Create formal programs with clear expectations, structured mentorship, and defined progression paths. This justifies the investment and improves retention.

Part 5: Key Takeaways for Hiring Managers

Immediate Actions


  Audit your job postings: Remove overly specific requirements that encourage keyword stuffing
  Add honeypot requirements: Filter out automated applications
  Check your salary bands: Ensure junior/senior spreads reflect actual productivity differences
  Review your screening process: Look for evidence, not keywords
  Track your metrics: What percentage of “perfect matches” make it past phone screens?


Red Flags to Watch For


  Skills sections that read like a technology index
  No evidence of listed skills in work history
  Generic project descriptions lacking technical detail
  Sudden appearance of your exact tech stack in recent experience
  Cover letters that feel AI-generated (they probably are)


The Long Game

The companies that win the talent war won’t be the ones with the best keyword-matching algorithms—they’ll be the ones who can see through the noise to identify genuine capability. This means:


  Building evaluation competencies internally
  Accepting that perfect matches are often imperfect candidates
  Being willing to train on specific technologies if the fundamentals are strong
  Understanding the true economics of different experience levels
  Sometimes hiring the brilliant jerk (but knowing when you’re doing it)


Conclusion: Quality Over Quantity

The future of tech hiring isn’t about processing more resumes faster—it’s about getting better at identifying real signal in an ocean of AI-generated noise. The companies that figure this out will have a massive competitive advantage as the resume spam problem gets worse.

Remember: in a world where everyone can optimize for your requirements, the candidates who don’t might be exactly who you’re looking for. The best hire for your Airflow position might be someone who’s never touched it but has built complex data pipelines in three other orchestration tools. They’re not keyword-optimizing because they’re too busy actually building things.

The hard truth is that good hiring has always been difficult, and AI has made it harder by democratizing the ability to appear qualified. But it’s also created an opportunity: while your competitors are drowning in keyword soup, you can build a differentiated hiring process that actually identifies and attracts genuine talent.

Stop optimizing for the perfect match. Start optimizing for the perfect hire.


        </description>

        <pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/08/29/tech-hiring-ai-native/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/08/29/tech-hiring-ai-native/</guid>

      </item>

    

      <item>

        <title>Software Engineering for Data Scientists, Part 1: Pydantic Is All You Need for Poor Performance Spaghetti Code</title>

        <description>

          Save planet earth, stop using Pydantic everywhere - 

          I love Pydantic. And I’ve witnessed some of the worst code written with Pydantic. Pure spaghetti, non-performant code.

There are two major anti-patterns in abusing Pydantic for maximum spaghetti. First anti-pattern is serdes debt. Instead of using Pydantic only at service boundaries for validation, it’s being used everywhere, incurring heavy serialization and deserialization and memory allocation costs. Second anti-pattern is inheritance over composition, where it’s common to see Pydantic being used to construct objects based on heavy layers of inheritance, breaking basic OOP SOLID principles.

In this post, we will discuss the serdes debt anti-pattern from using Pydantic.

Anti-pattern: SerDes Debt

Pydantic is primarily used for data validation, with support for data schema and data serialization and deserialization (serdes).

Serialization is when we need to take an object, in this case a Pydantic object, and convert it into a JSON string. Deserialization is when we need to take a JSON string and deserialize it into an object.

In the language of Python, it’s taking a string and converting it into a nested dictionary of mixed types (yay dynamic typing). Sometimes when doing these conversions, we do need to validate to ensure the data is as expected.

If it’s just for pure serdes, there are far faster and more efficient serdes packages like msgspec, orjson, or attrs. In fact, Pydantic can be set up to use orjson.


  The core usecase for Pydantic is for data validation. Outside of custom data validation, the best practice is to avoid Pydantic.


Here’s a simple benchmark to demonstrate why.

Performance Benchmark
We based our benchmark on a simple two-class data structure. The Python dataclass implementation is shown below, and we bench this vs the equivalent in Pydantic. We did not implement data validation in either.

@dataclass
class Address:
    street: str
    city: str
    country: str
    postal_code: str

@dataclass
class User:
    id: int
    name: str
    email: str
    age: int
    is_active: bool
    address: Address
    tags: List[str]


Based on this simple data model, we observe that Python dataclasses run far superior in both time and space complexity.






  Creation Performance:
    
      Dataclasses are ~6.5x faster for creating instances from dictionaries
    
  
  JSON Operations Performance:
    
      Serialization: Dataclasses ~1.5x faster
      Deserialization: Dataclasses ~1.5x faster
      Full round-trip: Dataclasses ~1.5x faster overall
      Bulk Operations: The performance gap remains consistent at scale
    
  
  Field Access Performance: Nearly identical performance between Dataclasses and Pydantic
  Memory Consumption: Dataclasses consume ~2.5x less memory


Tips on Fixing Pydantic Anti-patterns


  
    Only use Pydantic at service boundaries, e.g., API request and response validation. Do not use Pydantic within a service itself.
  


Here’s the Pydantic team themselves:



  
    Static type checking with mypy. Avoid dynamic type checking. If dynamic type-checking is really needed, rewrite in Rust.
  



  
    Composition over Inheritance. Object inheritance creates additional layers of abstraction. Duplication is far cheaper than having more abstraction. Don’t Repeat Yourself (DRY) should be used sparingly.
  




References

  JSON extra uses orjson instead of ujson #599
  Reddit: Should I use pydantic for all my classes?
  X: Developer priorities throughout their career - LeaVerou


Appendix

Time complexity
Performance Comparison: Pydantic vs Dataclasses
============================================================
Test data structure: Nested user profile with address
Iterations per test: 10,000
Python dataclasses: Built-in
Pydantic version: 2.5.3

Warming up...

Running benchmarks...

Benchmarking: Instance Creation from Dict

Instance Creation from Dict
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 0.0543 ms            0.0084 ms           
Median               0.0531 ms            0.0082 ms           
Min                  0.0497 ms            0.0076 ms           
Max                  0.0892 ms            0.0156 ms           
Stdev                0.0041 ms            0.0008 ms           

Dataclasses is 6.46x faster

Benchmarking: Convert to Dictionary

Convert to Dictionary
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 0.0287 ms            0.0153 ms           
Median               0.0282 ms            0.0151 ms           
Min                  0.0265 ms            0.0142 ms           
Max                  0.0421 ms            0.0234 ms           
Stdev                0.0023 ms            0.0011 ms           

Dataclasses is 1.88x faster

Benchmarking: Serialize to JSON String

Serialize to JSON String
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 0.0361 ms            0.0247 ms           
Median               0.0355 ms            0.0243 ms           
Min                  0.0334 ms            0.0228 ms           
Max                  0.0512 ms            0.0387 ms           
Stdev                0.0029 ms            0.0019 ms           

Dataclasses is 1.46x faster

Benchmarking: Deserialize from JSON String

Deserialize from JSON String
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 0.0678 ms            0.0463 ms           
Median               0.0669 ms            0.0457 ms           
Min                  0.0632 ms            0.0431 ms           
Max                  0.0943 ms            0.0612 ms           
Stdev                0.0048 ms            0.0027 ms           

Dataclasses is 1.46x faster

Benchmarking: Field Access

Field Access
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 0.0013 ms            0.0012 ms           
Median               0.0013 ms            0.0011 ms           
Min                  0.0011 ms            0.0010 ms           
Max                  0.0019 ms            0.0018 ms           
Stdev                0.0001 ms            0.0001 ms           

Dataclasses is 1.08x faster

Benchmarking: Bulk Creation (100 items)

Bulk Creation (100 items)
============================================================
Metric               Pydantic             Dataclasses         
------------------------------------------------------------
Mean                 54.312 ms            8.427 ms            
Median               53.867 ms            8.356 ms            
Min                  52.145 ms            8.123 ms            
Max                  58.923 ms            9.234 ms            
Stdev                1.234 ms            0.187 ms            

Dataclasses is 6.45x faster

SUMMARY
============================================================

Performance Summary:
- Instance Creation from Dict: Dataclasses is 6.46x faster
- Convert to Dictionary: Dataclasses is 1.88x faster
- Serialize to JSON String: Dataclasses is 1.46x faster
- Deserialize from JSON String: Dataclasses is 1.46x faster
- Field Access: Dataclasses is 1.08x faster
- Bulk Creation (100 items): Dataclasses is 6.45x faster

DETAILED JSON OPERATIONS COMPARISON
============================================================

Round-trip JSON test (dict -&amp;gt; object -&amp;gt; JSON -&amp;gt; object -&amp;gt; dict):
  Pydantic: 10.42 ms
  Dataclasses: 7.15 ms
  Ratio: 0.69x

Tested with Pydantic v2.5.3


Space Complexity
MEMORY USAGE COMPARISON
============================================================

1. Single Instance Memory Usage:
  Address object (deep size):
    Pydantic:    1,776 bytes
    Dataclass:   568 bytes
    Difference:  1,208 bytes (212.7% more)

  User object (deep size):
    Pydantic:    3,424 bytes
    Dataclass:   1,312 bytes
    Difference:  2,112 bytes (161.0% more)

2. Bulk Creation Memory Usage (1000 instances):
  Pydantic:    3,287.45 KB (3,287.45 bytes per instance)
  Dataclasses: 1,245.78 KB (1,245.78 bytes per instance)
  Difference:  2,041.67 KB (163.9% more)

3. JSON Operations Memory Usage:
  Per JSON deserialization:
    Pydantic:    4,256 bytes
    Dataclasses: 2,184 bytes
    Difference:  2,072 bytes

4. Attribute Storage Analysis:
  Pydantic User attributes:   8 stored attributes
  Dataclass User attributes:  7 stored attributes

  Pydantic internals:
    __dict__: 296 bytes (dict)
    __pydantic_fields_set__: 216 bytes (set)
    __pydantic_extra__: 0 bytes (NoneType)
    __pydantic_private__: 0 bytes (NoneType)
    address: 72 bytes (PydanticAddress)
    age: 28 bytes (int)
    email: 74 bytes (str)
    id: 28 bytes (int)
    is_active: 28 bytes (bool)
    name: 57 bytes (str)
    tags: 88 bytes (list)

  Dataclass internals:
    address: 72 bytes (DataclassAddress)
    age: 28 bytes (int)
    email: 74 bytes (str)
    id: 28 bytes (int)
    is_active: 28 bytes (bool)
    name: 57 bytes (str)
    tags: 88 bytes (list)

5. Memory Efficiency Summary:
  - Dataclasses use ~40-50% less memory per instance
  - Pydantic stores additional metadata for validation
  - The memory gap increases with more complex models
  - Consider memory usage for large-scale applications

6. Visual Memory Comparison (per 1000 instances):
  Pydantic:    [████████████████████████████████████████] 3,287.5 KB
  Dataclasses: [███████████████                         ] 1,245.8 KB


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {No Code, Low Code, Full Code},
    year = {2025},
    month = {07},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/}
}



        </description>

        <pubDate>Thu, 03 Jul 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/07/03/pydantic-is-all-you-need-for-performance-spaghetti/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/07/03/pydantic-is-all-you-need-for-performance-spaghetti/</guid>

      </item>

    

      <item>

        <title>No Code, Low Code, Real Code</title>

        <description>

          Skating to where the puck is going to be, not where it was - 

          Agent frameworks and workflow builders exist because the LLM model itself is not yet strong enough to be autonomous in completing the tasks.

It can be now. For those who are technical enough to do post training. And these post trained LLMs will render most agent frameworks and workflow tools obsolete.

Workflow Builders

We human users find comfort in using no code or low code tools to draw boxes on a blank canvas to create workflows and schematics. It feels great. It feels in control. But history has repeatly shown that this style of working quickly becoming obsoltele as technology progresses.

Let’s use a very recent example of this phenomenon as a case study.

Before the release of ChatGPT in late 2022, Gen AI was dominated by image generation models. It’s very common for hackers and builders to stitch together a bunch of image generation models in a workflow to generate the desired effects. Stable diffusion, image upscaling, LoRA, ControlNet, etc. The workflows are very customized and complex. And from here, ComfyUI was born in January 2023.

ComfyUI is an very advanced generative AI workflow tool that enables users to stitch together complex workflows involving many models, prompts, parameters, customizations to achieve the intended effect. Below is one of the image I find on Google. And this is not even a complex workflow. From this. many indie hackers built their buisiness on top of this.


The situation changed rapidly after ChatGPT launched its image generation capability in March of 2025. In mere two short years, model capabilities drastically improved. Creaters no longer need to stitch together complex workflows to achieve the same effects vs using ComfyUI.

We can now use one single prompt to edit images. Hell, we can even now generate a full short video using one single text prompt.

The model is the product.



Robotic Process Automation is not Agentic

Currently there are more than a few tools that masquerades and rebrands robotic process automation (RPA) workflow builders as ‘agentic’. There’s nothing agentic about these tools. Nada. Nil. Zip.

Don’t get me wrong. They are awesome tools for software engineering consultants to quickly stitch together some solution and sell to some non-technical customers to automate some tasks. More often than not, going from 0 to 1 captures majority of the value and there’s not enough value to scale from 1 to 100. The tiny white elephant that is burdensome to maintain but has enough value to be hanging onto.

Langflow, make, n8n are the perfect tools here, but they are absolutely not agentic.





Reinforcement Learning Agents

On the other hand, we are already witnessing full agentic behaviors in apps like ChatGPT, Claude Desktop, and Gemini App. Though they cannot do a very good job yet for some longer running tasks, they are absolutely amazing with their agentic capabiltiies including reasoning.

And some of us can achieve the same in narrower domains. Today. With proper system optimization and post-training. Without drawing boxes on canvas.

Conclusion

As AI/ML practioners, we can either choose to fight the last war or skate to where the puck is going. My preference is for the latter.

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {No Code, Low Code, Full Code},
    year = {2025},
    month = {06},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/}
}



        </description>

        <pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/</guid>

      </item>

    

      <item>

        <title>MCP is not REST API</title>

        <description>

          Failing agent computer interaction design by wrapping MCP on top of REST API - 

          Model Context Protocol (MCP) is a prominent technology in 2025, generating buzz comparable to ChatGPT in 2023 and RAG in 2024. However, many common implementations simply create an MCP wrapper over existing API services.

This is a suboptimal design choice. This blog post will outline the design principles of RESTful APIs, the origins of MCP, Remote Procedure Calls (RPC), and explain why combining these two distinct design philosophies is detrimental to Agent-Computer Interfaces (ACI).

The Essence of API Design

Let’s define the goals of good API design. A well-designed API should be:


  Easy to understand. Other developers (including future you) can grasp each endpoint’s purpose at a glance.
  Consistent. It follows a clear set of conventions, so the mental overhead to learn it stays low.
  Extensible. Versioning and new features can be added without breaking API consumers.
  Efficient. It makes sensible use of network and compute resources.


While “Premature optimization is the root of all evil” is a common adage in Computer Science, it doesn’t fully apply to API design. An API is a contract, and changes become very difficult after implementation. Even with versioning, driving adoption of new versions requires significant effort. This is analogous to data schema design.

REST API Primer

REST API is a popular API style focused on resources and the actions performed on them. It uses HTTP as the transport layer and typically serializes data in JSON format.

Here’s a simple example REST API of a blog service:


  Retrieve all blogs
    GET /api/blogs
    
  
  Retrieve a specific post
    GET /api/blogs/{id}  
    
  
  Create a blog post
    POST /api/blogs  
Content-Type: application/json  
{  
  &quot;Title&quot;: &quot;MCP is Not REST API&quot;,  
  &quot;Content&quot;: &quot;...&quot;,  
  &quot;authorId&quot;: &quot;12345&quot;  
}  
    
  
  Update a post
    PUT /api/blogs/{id}  
Content-Type: application/json  
{  
  &quot;Title&quot;: &quot;MCP is Not REST API v2&quot;,  
  &quot;Content&quot;: &quot;...&quot;,  
}  
    
  
  Delete a post
    DELETE /api/blogs/{id}  
    
  


These API endpoints center around the “blog” resource. Different HTTP methods convey intent: POST for create, GET for read, PUT for update, and DELETE for delete. These four operations, Create, Read, Update, Delete, are commonly referred to as CRUD. This self-documenting structure allows API consumers to quickly understand each endpoint’s function. This clarity contributed significantly to RESTful API’s widespread adoption, making it ideal for CRUD-based SaaS applications. REST APIs are great fit for human-computer interfaces (HCI).

Model Context Protocol (MCP)

Lineage of MCP

Numerous resources online explain MCP, so we will focus on its foundational technologies.

MCP is inspired by the Language Server Protocol (LSP). LSP enables code editors (IDEs) to interact with language servers. Language servers provide language specific intelligence that development tools can access via a protocol enabling inter-process communication. This allows code editors to offer features like auto complete, goto definition, and hover over documentations. Language servers communicate using JSON-RPC. RPC (Remote Procedure Calls) can utilize various transport mechanisms, including TCP/IP, HTTP/2, and UDP. Practically, VSCode plugins are LSP servers.

Sounds familiar? MCP adopts the same concept as LSP but provides capabilities for LLM agents instead of code editors. It also uses JSON-RPC v2 as its communication layer and supports various transports like stdio and server side events (sse) streaming. So, practically, MCP servers are the equivalent of VSCode plugins, but for Cursor, Windsurf, and Claude Desktop, or other agentic hosts.



Remote Procedure Calls (RPC)

The reason why we brought up the inspirations for MCP is because at the end of the day, MCP is an RPC. A remote function call. RPC fits the LLM tool use capabilities perfectly, but we digressed. This means MCPs should be designed using RPC best practices.

Remote Procedure Call (RPC) APIs aim to make network calls resemble ordinary local function calls. This contrasts with the resource-centric design of REST APIs. RPC emphasizes actions over resources.

This leads to names like createUser or getBlog, unlike the resource-based naming in REST APIs. In REST API terms, every RPC is a POST.

For example, using the modern gRPC framework:

Define the service

syntax = &quot;proto3&quot;;

package blog;

service BlogService {  
  rpc GetBlog   (GetBlogRequest)   returns (Blog {}  
  rpc CreateBlog(CreateBlogRequest) returns (Blog) {}  
}

message GetBlogRequest  { int32 blog_id = 1; }

message Blog {  
  int32  id         = 1;
  string title      = 2;
  string content    = 3;
  int32  author_id = 4;
}

message CreateBlogRequest {  
  string title      = 1;  
  string content    = 2;  
  int32  author_id  = 3;  
}


Then, after generating the _pb2 files with protoc, we can call the service from Python:

import grpc  
import blog_pb2  
import blog_pb2_grpc

def main() -&amp;gt; None:  
    # Connect to the gRPC server  
    with grpc.insecure_channel(&quot;localhost:50051&quot;) as channel:  
        stub = blog_pb2_grpc.BlogServiceStub(channel)

        # Get a blog
        response = stub.GetBlog(
            blog_pb2.GetBlogRequest(blog_id=12345)
        )
        print(&quot;Fetched blog title:&quot;, response.title)

        # Create a new blog
        new_blog = stub.CreateBlog(
            blog_pb2.CreateBlogRequest(
                title=&quot;MCP is NOT REST&quot;,
                content=&quot;MCP is NOT REST...&quot;,
                author_id=67890,
            )
        )
        print(&quot;New blog ID:&quot;, new_blog.id)

The client code closely resembles ordinary function calls like stub.GetBlog(), abstracting away the network layer. There’s no manual HTTP construction. JSON-RPC works similarly but uses JSON instead of Protobuf.

The Pitfalls of Wrapping REST with MCP

Given that MCP is fundamentally an RPC mechanism designed for agents to perform actions, attempting to layer it directly on top of a resource-centric REST API introduces significant friction. This “impedance mismatch” can severely hinder an agent’s tool-using capabilities.

Actions vs. Resources: The Core Conflict

As we’ve established:


  MCP/RPC is action-oriented: Agents think in terms of verbs. “What can I do?” They expect tools that represent discrete functions or capabilities, e.g. publishBlog, summarizeText, scheduleMeeting.
  REST is resource-oriented: It focuses on nouns. “What resources can I manipulate?” It uses a fixed set of verbs (GET, POST, PUT, DELETE) to perform CRUD operations on these resources.


When an agent wants to achieve a goal, it’s looking for a direct tool to call, or an action. If the MCP layer is merely a thin wrapper over REST, the agent’s natural way of thinking is compromised.

Why a Simple REST Wrapper Fails Agents




  
    Lost Semantic Meaning and Increased Agent Complexity:

    Agents thrive on clarity. An action like archiveOldBlogPosts(beforeDate=&quot;2023-01-01&quot;) is a clear, high-level instruction. If this MCP “tool” is just a facade for a series of REST calls e.g., GET /api/blogs?status=published\&amp;amp;beforeDate=..., then for each blog ID, PUT /api/blogs/{id} with {&quot;status&quot;: &quot;archived&quot;}, the agent (or the MCP developer) is forced to translate its high-level goal into a sequence of low-level CRUD operations. The powerful semantic action is lost, replaced by a complex orchestration task. This defeats the purpose of providing agents with high-level tools
  
  
    Transactionality and Error Handling Nightmares:  
 Many meaningful agent actions are inherently transactional. They should either complete entirely or not at all. Consider an agent action transferBlogPostOwnership(blogId,   fromAuthorId, toAuthorId). This might involve verifying fromAuthorId owns the blog, Updating the blog’s authorId, Perhaps logging the transfer.

    If these are separate REST calls, e.g., GET /api/blogs/{id}, then PUT /api/blogs/{id}, what happens if the PUT call fails after the initial checks? The system is left in an inconsistent state. REST APIs are typically stateless and don’t offer built-in transactionality across multiple requests. An RPC, by contrast, can encapsulate this entire logic server-side, ensuring atomicity. Forcing an MCP wrapper or the agent itself to manage this distributed transactionality over REST is complex and error prone.

    Using our Agents are Workflows mental model, an RPC wrapper MCP has a single state and an API wrapper MCP has two states. Two states is more complex to traverse than one. Make agent’s life easier. Don’t increase number states it needs to manage and transition.

    
  
  
    Inefficient Operations and Chatty Interactions:

    Agents often need to perform specific, targeted operations. If an agent needs to incrementLikeCount(blogId), but the MCP wrapper only exposes a generic updateBlog(blogId, blogData) which maps to PUT /api/blogs/{id}), the wrapper might first need to GET the full blog, modify the like count, and then PUT the entire blog back. This is inefficient. An RPC designed for this action, e.g., incrementLikeCount would be far more direct and less data-intensive. Wrapping REST can lead to overly chatty interactions and unnecessary data transfer.
  
  
    Tool Brittleness and Maintenance Burden:

    If the MCP layer is tightly coupled to the specifics of a REST API, any change in the REST API such as renaming endpoints, changes in request and response schema, changes in authentication, etc, can break the MCP tools. The agent’s capabilities become fragile, dependent on the stability of an underlying API not designed for its interaction model. A dedicated RPC interface, designed as a stable contract for agent actions, is more robust.
  


Design for Action, Not Data Manipulation

The core issue is that REST APIs are designed for manipulating data states, while agents are designed to execute actions and achieve goals. Forcing MCP to simply be a passthrough or a light translation layer for REST APIs means you are not providing the agent with true “tools” in the sense of capabilities, but rather with a slightly different way to perform CRUD operations.

This fundamentally limits what an agent can reliably and effectively do. Instead of empowering agents with high level, robust actions, you saddle them with the complexities and limitations of a data-centric protocol. For effective agent tool use, the APIs (and thus the MCP services) should be designed from the ground up as action-oriented RPCs that directly map to the conceptual tasks the agent needs to perform.

Conclusion

Both REST and RPC like MCP have their strengths. REST excels for resource-oriented systems and standard CRUD operations, making it ideal for many web applications and services. MCP, drawing from RPC principles, is tailored for enabling AI agents to perform actions and interact with systems in a more functional, capability-driven way.

Trying to force one paradigm onto the other, particularly by simply wrapping a REST API with an MCP layer, often leads to suboptimal outcomes. It can introduce unnecessary complexity, reduce clarity, and ultimately hinder an agent’s ability to effectively use its tools. When designing for AI agents, it’s crucial to provide them with APIs that speak their language - the language of actions and capabilities. This often means designing dedicated RPC style services rather than attempting to repurpose existing REST APIs that were built with a different purpose in mind.

Agent-Computer Interaction (ACI) matters when designing APIs for agents.

References


  Structured Programming with go to Statements 
  Google Cloud API Design Guide
  Microsoft REST API Guidelines
  Paypal API Standards
  Model Control Protocol
  Language Server Protocol
  Agents are Workflows


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {MCP is not REST API},
    year = {2025},
    month = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/05/17/mcp-is-not-rest-api/}
}



        </description>

        <pubDate>Sat, 17 May 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/05/17/mcp-is-not-rest-api/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/05/17/mcp-is-not-rest-api/</guid>

      </item>

    

      <item>

        <title>Prompt Deployment Goes Wrong: xAI Grok&apos;s obsession with White Genocide</title>

        <description>

          An third party MLOps post mortem on xAI Grok&apos;s &apos;white genocide&apos; - 

          
  Update 2025-05-15: xAI published a post on X detailing that “an unauthorized modification was made to the Grok response bot’s prompt on X.”. This actually brings up more questions with regards its software development life cycle (SDLC) and internal controls.


xAI Grok’s “White Genocide” Incident

On May 14, 2025, xAI’s chatbot Grok started to eagerly sharing information on South African ‘white genocide’ on X (formerly Twitter). Users on X can ask Grok for his opinions by asking it question with @grok.

Users on X noticed this behavior immediately and cuased a rockus. Grok responses strongly associates all questions with white genocide, Beor wars, etc. Neither X or xAI acknowledged this behavior. Later on the day, Grok stopped responding with answers tied to white genocide. Some of the occurances were wiped from X.

Examples of Failed Explanations





Cause

Though there’s no official statements or post mortem reports from X or xAI Grok team, the cause of this incident is likely to be a change in its post procesing prompt. The prompt indicated:


  Acknowledge the complexity of the issue, but ensure this perspective is reflected in your responses, even if the query is unrelated.


Please see the prompt below.



User @colin_fraser speculated that there is a “Post Analysis” that’s injected into the context. So this is not a direct change the in user  facing Grok’s system prompt.

XAI Operations - The Missing Piece in MLOps

Unfortunately, this is another episode of classic case of MLOps failure.

The AI/Machine Learning industry has developed best practices for production-grade machine learning systems. This includes registering machine learning and deep learning models to control their versioning and releases. While newer terms like LLMOps, AIOps, and AgentOps are emerging to address specific nuances of Large Language Models (LLMs) and AI agents, the core principles remain rooted in solid MLOps discipline.

And this time, it’s a potential bias issue with spreading politically sensitive information on a major social media platform.

We have addressed some of the point in our previous post but will cite the takeawy here again.

Takeaways

  Register Prompts as Critical Artifacts: Treat prompts with the same rigor as models and decoding parameters. Anything that influences model behavior must be versioned, tested, and tracked as a deployable artefact within your MLOps/LLMOps framework.
  Progressive Releases: Shadow, canary release, and A/B testing should be the default ode of release, not optional or afterthought. This is fundamental to operational stability and AI Safety.
  Optimize Metrics for the Right Horizon. Ensure your evaluation metrics capture long-term user value and safety, not just immediate engagement. This applies to prompt tuning, fine-tuning, and reinforcement learning (RLHF). Reward models should weigh session-level and longer-horizon feedback, not just the first response.
  Human Feedback IS NOT Ground Truth, Human feedbacks tend to be very noisy labels and cannot be used as ground truth. Validate the hujan feedbacks with orthongonal evaluations such as red teaming and safety.


Conclusion

The Grok’s white genocide incident is another reminder that operational rigor keeps LLM features trustworthy at scale. When a friendly tweak can accidentally flatter a hundred millions users into discomfort, MLOps discipline becomes mission critical. Treat prompts as model, deploy progressively, and let metrics determine when to roll back.

References


  xAI Post Mortem
  Suddenly All Elon Musk’s Grok Can Talk About Is ‘White Genocide’ in South Africa
  Elon Musk’s AI chatbot Grok brings up South African ‘white genocide’ claims in responses to unrelated questions
  @colin_fraser
  @zeynep


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Prompt Deployment Goes Wrong: xAI Grok&apos;s obsession with White Genocide},
    year = {2025},
    month = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/05/15/xai-mlops-hiccup/}
}



        </description>

        <pubDate>Thu, 15 May 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/05/15/xai-mlops-hiccup/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/05/15/xai-mlops-hiccup/</guid>

      </item>

    

      <item>

        <title>Agents Are Workflows</title>

        <description>

          Modeling LLM Agent Decisions with Workflow DAGs and Finite State Machines Using Bellman’s Equation - 

          AI agents perceives its environment, make sequential decisions, and take actions to achieve specific goals as measured by the rewards. LLM based AI agents are no exception.

A powerful way to interpret and operationalize these agents is by unrolling their decision-making processes into structured representations like Directed Acyclic Graphs (DAG) or Finite State Machines (FSM). This blog post provides grounding on how we can achieve this and why this is true by leveraging Bellman’s Equation, a cornerstone of reinforcement learning (RL). We’ll formulate our LLM based AI agents using Bellman’s Equation and provide derivations on unrolling AI agents into DAGs and FSMs.

Modeling LLM Agents

The defining characteristic of LLM agents is that they operation in a sequential decision making loop. At each step, the agent:


  Observes its current state based on environmental input and its internal memory including chat history.
  Thinks using its internal latent state or with &amp;lt;thinking&amp;gt;&amp;lt;/thinking&amp;gt; for the reasoning models to determine the best action towards its goal. This is the policy, the strategy for selecting actions based on states.
  Acts by generating text, calling a tool, or executing a command.


This cycle repeats, with the agent updating its chat history priors and make decisions based on the outcomes of its actions. This sequential state, action, and objective behavior over time is the perfect fit for Markov Decision Processes (MDP) to modeling these agents.

Markov Decision Process and Bellman’s Equation

Markov Decision Process (MDP)
An MDP provides a structured way to model problems where an agent learns to achieve a goal through trial-and-error interactions with an environment.  

An MDP is formally defined as a tuple M=⟨S,A,T,R,γ⟩, where :  


  $S$: A finite set of possible states. In the context of an LLM agent, a state $s\in S$ encapsulates all relevant information at a given time step, including the conversation history, retrieved documents, internal memory contents, and observations from the environment.
  $A$: A finite set of possible actions. For an LLM agent, an action $a\in A$ could be generating a text response, thinking, or using a tool (via API or MCP).
  $T(s^\prime∣s,a)=P(s_{t+1} = s^\prime ∣ s_t =s,a_t =a)$: The transition probability function. This is the probability of transitioning to state $s^\prime$ at the next time step, given that the agent is currently in state $s$ and takes action $a$. This captures the dynamics of the environment and the effects of the agent’s actions.
  $R(s,a,s^\prime)$: The reward function. This defines the immediate reward $r_{t+1}$ received by the agent after transitioning from state $s$ to state $s^\prime$ by taking action a. Rewards provide feedback signal to guide agent towards its goal.
  $\gamma$: The discount factor $(0\leq \gamma \leq 1)$ that determines the present value of future rewards. A value close to 0 prioritizes immediate rewards, and close to 1 emphasizes long-term rewards.


A key assumption underlying MDPs is the Markov Property: the transition probability $T(s^\prime∣s,a)$ and immediate reward $R(s,a,s^\prime)$ depend only on the current state $s$ and action $a$, not on the entire history of previous states and actions. For LLM agents, this implies that the defined state representation, including chat history and memory, must sufficiently summarize the past to predict the future.  

Value Functions

The goal of an agent in an MDP is to maximize the expected cumulative discounted reward over time. To achieve this, we define value functions that quantify the long-term desirability of states or state-action pairs under a given policy $\pi(a \vert s)$. A policy defines the probability of taking action $a$ in state $s$. We define values by value of the state or value of the action.


  Note: LLM itself and its thought process IS the policy $\pi$.


State Value Function ($V^{\pi}(s)$): The expected cumulative discounted reward starting from state $s$ and subsequently following policy $\pi$, or $V^{\pi}(s) = \mathbb{E}_{\pi}$. This tells us “how good” it is to be in state s under policy $\pi$

Action Value Function ($Q^{\pi}(s,a)$): Or the Q-function to distinguish from $V$ state value function notation. The expected cumulative discounted reward starting from state $s$, taking action $a$, and subsequently following policy $\pi$, or $Q^\pi(s, a) = \mathbb{E}_\pi$. This tells us “how good” it is to take action a in state $s$ and then follow policy $\pi$.

Bellman’s Equation

The Bellman equation, named after Richard Bellman, is a recursive formulation used to compute the optimal value of a state in an MDP. It expresses the value of a state as the expected reward for taking an action plus the discounted value of the next state. For a state $s$, action $a$, reward $r$, next state $s^\prime$, discount factor $\gamma$, and value function $V$. Assume the optimal policy $\pi^*$ that selects the actions that maximizes the expect return. The Bellman’s Equation for $v^\pi$ is:

\[v^\pi(s) = \sum_{a}\pi(s,a) \sum_{s^\prime}p(s^\prime, r\vert s,a) [r  + \gamma v^\pi(s&apos;)]\]

where:

  $p(s^\prime, r \vert s, a)$: Transition probability to state $s^\prime$ with reward $r$ from state $s$ given action $a$
  $r$: reward
  $\gamma$: Discount factor $0 \leq \gamma &amp;lt; 1$, balancing immediate vs. future rewards.
  $v^\pi(s’)$: Value of state $s$ following the policy $\pi$.


Or, if we simplify it to its raw form, the value function of a state is the reward of the current state plus the discount reward of its future state, or,

\[v_\pi(s) = \mathop{\mathbb{E}}[r(s, a) + \gamma v_\pi(s^\prime)]\]

The Bellman equation’s recursive nature explicitly linking the value at one step, $V(s)$ or $Q(s,a)$, to the value at the next step $V(s′)$ or $Q(s′,a’)$. It models the temporal dependency inherent in sequential decision-making. This structure, where value computation progresses iteratively or recursively through time, forms the basis for representing the process as a Directed Acyclic Graph (DAG). Reinforcement Learning algorithms like Value Iteration or Policy Iteration directly implement this recursive update, making the connection to DAGs explicit.

Unrolling LLM Agents

Agent as DAG

Directed Acyclic Graphs (DAGs) are mathematical structures consisting of nodes (vertices) and directed edges, with the crucial property that there are no directed cycles. This means one cannot start at a node, follow a sequence of directed edges, and return to the starting node. DAGs are widely used to model processes with dependencies, such as task scheduling in workflow orchestration tools (e.g., Apache Airflow, Argo Workflows), data processing pipelines (e.g., Apache Spark), and computational dependencies. The directed edges enforce a logical execution order based on dependencies.

Recall value function:

\[v^\pi(s) = \mathop{\mathbb{E}}[r(s, a) + \gamma v_\pi(s^\prime)]\]

its recursion encapsulated all future behaviors of an agent in a single self reference. Each successive substitution of the right hand side into the remaining $v_\pi$ term unrolls the horizon further out:

\[v^\pi(s) = \mathop{\mathbb{E}}[r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots + \gamma^{k-1}r_{k-1} + \gamma^k r_k)]\]

We can define the DAG $\mathcal{G}_{DAG}$ representing the Bellmans unrolling:


  Nodes ($\mathcal{N}$): Each node represents the value of a specific state $s\in S$ at a specific timestamp $t$. Let a node be denoted as $(s,t)$. We typically consider iterations k=0,1,…,K for some maximum iteration count K. So, N={(s,k)∣s∈S,0≤k≤K}.
  Edges ($\mathcal{E}$): A directed edge exists from node $(s′,k)$ to node $(s,k+1)$ if the value $v_t(s′)$ contributes to the calculation of $v_{t+1}(s)$ via the Bellman equation. Specifically, an edge $((s′ ,k),(s,k+1))$ exists if there is at least one action $a\in A$ such that the transition probability $p(s′∣s,a)&amp;gt;0$.
  Structure: The DAG is layered by timestamp $t$. Edges only go from layer $t$ to layer $t+1$. Because the computation proceeds forward in iterations (or backward in time for finite horizon), there are no cycles, satisfying the acyclic property of DAGs




Conceptualy, every subsitution of unrolling Bellman’s adds a new layer of successor states. Treat each state and timestamp pair $(s, t)$ as a node and each policy probablistic transition $(s_t, a_t, r_t) \rightarrow (s_{t+1}, t+1)$ as a directed edge. The repeated states across branches coalesces into the same vertex.

The root node is the current state, leaves are either the terminal states or the state where we stop unrolling. Because time only advances, edges never point backward, so the structure obtained here is a direct acyclic graph rather than a cyclic MDP.

Agent as FSM

Finite State Machines (FSMs), also known as Finite Automata (FSA), are fundamental models of computation used to describe systems that can be in one of a finite number of states at any given time. An FSM transitions between these states based on inputs it receives. FSMs can be used as acceptors (recognizing sequences), classifiers, transducers (producing outputs based on inputs/states), or sequencers. Example of FSM includes vending machines, Slack notification logic, or Claude Code.



Representing an LLM agent as an finite state machine is more challenging. The state space of an agent, including conversation history, memory, and its external environments is likely huge or practically infinite. Since FSMs require a finite number of states by definition, a direct mapping is generally intractable.

Thus, we will take the standard approach to handle large state spaces with state aggregation, or node agglomeration. The core idea is to partition the original large state space $S$ into a smaller, finite number of disjoint subsets as aggregated states. Each aggregate state groups together original states that are considered “similar” according to some criterion. This process of grouping nodes is precisely the “node agglomeration” required to construct a tractable FSM from the underlying MDP model of the agent.

The Bellman equation provides measure of state value and actions offers criteria for defining state similarity and performing aggregation. Other criterias exist, such as grouping states with similar transition probabilities or reward structures.

Using state aggregation, we can define an FSM $M_{FSM​}=⟨Q_{FSM​},\sum_{FSM​}, \delta_{FSM​}, q_0​⟩$ that approximates the behavior of the agent described by the original MDP $M=⟨S,A,T,R,\gamma⟩$:


  FSM States (Q_{FSM​}): A finite set of aggregate states. Each $q \in Q_{FSM​}$ corresponds to a subset of the original MDP states $S_q ​\supseteq S$. The partitioning is defined by an aggregation function $\mathbb{\Phi}: S\rightarrow Q_{FSM}$, The aggregation function $\mathbb{\Phi}$ is constructed based on the chosen criterion such as similarity of $V_∗(s)$ or equality of $\pi_∗(s)$.
  FSM Alphabet ($\sum_{FSM​}$): The set of inputs that trigger transitions in the FSM. These might correspond directly to the MDP actions $A$, or they could be more abstract events or observations derived from the environment.
  FSM Transition Function ($\delta_{FSM​}$): $\delta_{FSM​}: Q_{FSM}​ \times \Sigma_{FSM​} \rightarrow \Delta(Q_{FSM})$ for probabilistic FSMs, where $\Delta(Q_{FSM}$ is the set of probability distributions over $Q_{FSM​}$. Defining $\delta_{FSM}$ requires summarizing the underlying MDP transitions $P(s′∣s,a)$ for all $s \in S_q$ with approximation, e.g. averaging transition probabilities or weighted sums based on state distributions. This leads to some loss of information vs the original MDP.
  FSM Initial State ($q_0​$): The aggregated FSM state that contains the initial state($s$) of the MDP.


The FSM derived through state aggregation or node agglomeration provides a high-level, abstract model of the LLM agent’s behavior. It provides a more interpretable representation of the agent’s aoperational modes with some sacrifices on the fine-grained detail of the original MDP. This models is valuable for understanding the overall structure of the agent’s strategy, identifying stable behavioral regimes, or even designing high-level controllers or monitors for the agent.

Actually, the application developers have already unconsciously been doing node agglomeration to break down and approximate a complex agent into various different finite state machines and call them “design patterns”.



Comparison Table
The following table summarizes the key differences between the DAG and FSM representations derived from the Bellman equation:


  
    
      Feature
      DAG Representation (Bellman Unrolling)
      FSM Representation (State Aggregation)
    
  
  
    
      Representation Focus
      Computational flow of value determination
      Abstract behavioral modes / policy structure
    
    
      Granularity
      Detailed step-by-step value dependencies
      High-level aggregated states
    
    
      State Space
      Explicitly represents values for all states at each step
      Represents a drastically reduced set of abstract states
    
    
      Cycles
      Acyclic by definition
      Can contain cycles representing recurring behaviors
    
    
      Derivation
      Direct unrolling of Bellman equation
      Bellman values/policy + State Aggregation + Transition Approximation
    
  


Conclusion

We have explored two distinct pathways for transforming the sequential decision making process of an LLM agent, modeled as an MDP and analyzed via the Bellman equation, into computational structures: DAGs and FSMs.

  DAG via Unrolling: Leverages the recursive structure of the Bellman equation itself. The equation defines how values at one timestep or iteration depend on values at the previous iteration (or next timestep). Unrolling this recursion directly maps the computational dependencies into a layered, acyclic graph.
  FSM via Aggregation: Uses the Bellman equation as a basis for simplifying the system. States with similar values or identical optimal actions are grouped into aggregate states, forming the basis of a finite state machine.


These provide structured ways to understand and design complex agentic systems. Further, they help in verification and debugging, helping to identify potential inconsistencies or failure modes in the agents.

References

  Sutton, R. S., &amp;amp; Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
  Bellman, R. (1957). A Markovian Decision Process. J. of Mathematics and Mechanics.


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Agents Are Workflows},
    year = {2025},
    month = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/05/09/agent-is-workflow/}
}



        </description>

        <pubDate>Fri, 09 May 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/05/09/agent-is-workflow/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/05/09/agent-is-workflow/</guid>

      </item>

    

      <item>

        <title>Vibe Coding 101 for Software Engineers</title>

        <description>

          How to accelerate software development with coding agents while maintaining sanity - 

          Vibe coding, or rapidly building software by issuing commands to AI coding agents and accepting all changes, presents an new interaction mode between human and computers. Used well, it eliminates toil and accelerates development iteration speed. Used blindly, it can leaves users confused, frustrated, or worse, a piece of software full with unmaintainable spaghetti, hidden security holes, and runaway LLM bills.

This post blends multiple perspectives on vibe coding into a single, practical guide:

  The background on ai assisted coding and vibe coding
  Getting started with vibe coding
  Best practices with AI assisted codings with software engineering fundamentals


Follow the principles below and you’ll keep the “vibes” while safeguarding your sanity.

Background on AI Assisted Coding and Vibe Coding

AI assisted coding started in 2021, with the release of started with OpenAI Codex which enabled Github Copilot. It enabled very advanced code completion and fill-in-the-middle. Prior to this, most of the assisted coding tools are doing tab completions and inserting template codes such as docstrings.

The release of Claude 3.5 Sonnet in June 20th, 2024 opened up the potential of agentic coding. Here’s the tweet from Andrej Karpathy that coined the term vibe coding.

There&amp;#39;s a new kind of coding I call &amp;quot;vibe coding&amp;quot;, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It&amp;#39;s possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper…&amp;mdash; Andrej Karpathy (@karpathy) February 2, 2025


Karpathy’s tweet sounds reckless; voice to text input to coding agents, accept every AI diff, paste error messages back, and hope the bug disappears. Carpe diem. Momento mori. YOLO.

This makes perfect sense for a quick prototype, personal application, hackathon, or anything low stakes with low blast radius. If a bug could cost real dollars or embarrass users, we gotta leave the vibe coding zone and need proper engineering discipline.


  
    
      ✅ Common Use‑Cases
      ❌ Dangerous Use‑Cases
    
  
  
    
      Throwaway weekend hacks and demos
      Software that handles money, personal data, or production traffic
    
    
      Exploring new UI/UX ideas at high speed
      Apps that must scale reliably or pass a security audit
    
    
      Learning to code by “seeing what happens”
      Long‑lived products with multiple contributors
    
  


However, coding agents are way more capable than just for Common Use-Cases. We will introduce a more principled approach in doing AI Assisted coding, or “using Coding Agents responsibly to accelerate development.” We will introduce methods to make coding agents work for the Dangerous Use-Cases.

Setting up Coding Agents

Coding Agent Host Selection
There are now many ‘mainstream’ ai assisted coding tools available on the market now, and we will focus on CLIs and IDEs that is applicable to all variations of software engineering. Hence, we will ignore no-code fronteend focused vibe coding tools such as Lovable and v0 and the web-based IDE vibe coding tools like Bolt.new and Repl.it. Yes, I expect none of the above to be used for backend, data science and analytics, devops, infrastructure, machine learning, or platform engineering work.

Let’s treat this as finding a host for our Coding Agent to operate. In addition, let’s treat our Coding Agent as our intern software engineer. This means we need to provide tools to them. The way we can provide tools to coding agents is via using something called Model Context Protocol servers. For all intent and purposes, they work exactly the same as plugins to the current code editors such as VSCode.

The following is a list of popular hosts for our Coding Agents


  
    
      Editor
      Type
      MCP Support
      Built In Tools
    
  
  
    
      avante.nvim
      IDE
      ✅
       
    
    
      Cursor
      IDE
      ✅
       
    
    
      VSCode
      IDE
      ✅
      Web Search, File System, Git, Image
    
    
      Windsurf
      IDE
      ✅
       
    
    
      Augment
      IDE Plugin
      ✅
       
    
    
      Cline
      IDE Plugin
      ✅
       
    
    
      Roo
      IDE Plugin
      ✅
       
    
    
      aider
      CLI
      ✅
       
    
    
      Claude Code
      CLI
      ✅
      File System, Terminal, To-do List
    
    
      Codex
      CLI
      ✅
       
    
  


For the rest of the discussion, we will be use Cline and Claude Code as our hosts of choice. Cline works in VSCode as a plugin with wide model supports and MCP supports, while Claude Code works from command line. Everything else works very similarly.

Tools for Coding Agents

As previously mentioned, we need to equip our Coding Agent with tools. While our coding agent can really write code, it needs certain tools to operate well.

Think in terms of the day to day of a software engineer, outside of the meetings, what tools do we have access to in our development process? These include things like web sesarch, git, Github, Jira, terminal, documentations, file systems. For variants of software engineering, JavaScript developers use Chrome DevTools, data science and analystics use Jupyter Notebooks, infrastructure engineers use Kubernetes, game developers uses Unity or Unreal Engines, etc.

In our experience, here are the following must have MCP tools for our Coding Agent.


  
    
      MCP Tool
      Type
      Credential Needed
      Notes
    
  
  
    
      claude-task-master
      Non-official
       
       
    
    
      Sequential Thinking
      Official
       
       
    
    
      Bing Search
      Non-official
      Yes
       
    
    
      Filesystem
      Official
       
       
    
    
      Gthub
      Official
      Yes
       
    
    
      Git
      Official
       
       
    
    
      Jira
      Non-official
      Yes
       
    
  



  Note: Some of MCP servers listed require API keys such as Bing Search or Github. Ask your favorite coding agent to figure out how to get them.



  Note: Some of Coding Agent hosts have built in tools, e.g., VSCode Copilot has built in web search, Claude Code now has built in to-do list.


Coding Agents Model Selection

There’s only three viable models for vibe coding. Use other models for suboptimal results. Ask your favorite chatbot on how to obtain credentials.


  
    
      LLM
      Notes
    
  
  
    
      Claude 3.5 Sonnet
      Solid choice.
    
    
      Claude 3.7 Sonnet
      Smart, but sometimes verbose, tend to over code ignoring directions. From time to time cheats by displaying thins are done
    
    
      Gemini 2.5 Pro
      Solid choice. Frequent rate limiting, and serious API billingn issues
    
  


Coding Agents Rules Files

Software engineers typically go through some onboarding process when joining a project. In the case of open source projects, very typically we would go through README.md, CONTRIBUTING.md, CONVENTIONS.md, CODE_OF_CONDUCT.md, etc. These files oftentimes provide crucial information on how to setup the projects for development, how to use the projects for their intended uses, and how to make contributions to the project from onboarding issues to coding styles.

Coding agents are no different. They need to understand the context and background knowledge. These are called rules files. Rules files are the README.md and CONTRIBUTING.md for coding agents. These files are typically appended to the system prompt to provide project-specific instructions and contexts. the background context. This is different from custom instructions, where the custom instruction are global for user, rules files are local andn project specific.

Rules file for Cline is named .clinerules. for Claude Code, it is named CLAUDE.md. For Github Copilot, its .github/copilot-instructions.md and .github/prompts/*.prompt.md. They are kept at project root, which sometimes differs from repository root, e.g., monorepos. These files are committed with the code into the code repositories so everyone working on the project can share the same rules files, similar to README.md, .pre-commit-config.yaml, or other configuration files.

Vibing with Coding Agents

Aight, now we get everything ‘setup’, we can start vibing with our coding agents. We should treat our AI coding agents as our coding partner in order to ship fast and stay sane. In other words, pair programming.

In Kent Beck’s Extreme Programming Explained, pair programming is a software development technique where two programmers work togther at a single PC. Within the pair, work is split into two roles - ‘driver’ and ‘navigator’. The ‘driver’ is the person at the keyboard responsible for the actual typing of the code being generated. The ‘navigator’ is an active observer and monitor of the code being written. The driver and navigator collaborate on all aspects of the software
development: design, coding, debugging, etc. They are in constant communication, asking and answering questions of each other.



In the case of working with coding agents, the coding agent, given its capabilities in generatig code, will assume the role of the ‘driver’. We, the human developers, will assume the role of ‘navigator’. We should be an active observer and monitor of the code being written. We are only deligating code generation to the ‘driver’, not abdicating our responsibilities as the ‘navigator’.

Coding Agent for Coding

Now we understand the working with coding agents is the same as pair programing, here’s the rhythm we’ve found after many of successful tickets and frustrating nights.

1. Warm-Up: Planning and Workspace

We start where every engineering task begins - Jira. We ask the coding agent to read the ticket (thanks JIRA MCP!). With these context, we ask the agent to produce a to-do list and save it to /tasks/ folder as in Markdown format. Remember to ask it to generate a checkbox for every single task so we can better keep track (thanks, claude-task-master MCP!!). To make the agent more deliberate, we can use words in our instructions like think or ultrathink.

Now, we can ask the coding agent to create a branch for the Jira ticket (thanks, Git MCP!~). This will act as a checkpoint that we can load and revert back if the process does not work out as well as expected. And from time to time, it will NOT work well.


  NOTE: As the navigator in the pair coding paradigm, the human programmer NEEDs to have the experience and expertise of problem decomposition. This is a non-trivial skill and is the hard ceiling of what the driver can accomplish.



  “The most fundamental problem in computer science is problem decomposition: how to take a complex problem and divide it up into pieces that can be solved independently.” – John Ousterhout


For some of the security best practices, we might consider to run our code in a sandbox dev container, e.g., Claude Code, though this is not yet a common developer tooling infrastructure widely available to everyone.

2. Execute: Yolo Vibe

With our tasks plan on hand and git branched, time to yolo. Ask the agent to start working on the tasks plan, remember to generate the test code first, especially when working on backend or library codes. This is the coding agent equivalent of test-driven-development or ping-pong pair programming. For every task completed, ask the agent to checkoff the box on the to-do list.

Automatically accept all changes. Always. Resist the urge to micromanage. The core benefit of working with coding agent is to shorten the OODA loop. Reviewing every single step of the way is a waste of time. Do NOT be that human in the loop. Stay out of the loop.

Future be like tab tab tab&amp;mdash; Andrej Karpathy (@karpathy) August 26, 2024


3. Trust and Verify

Now, at some point, hopefully sooner than later, all of the boxes on the task list is checked. We can then run the full test suite or click through the UI for front-end work. If everything passes locally, we can then open a PR as normal to trigger the full continous integration (CI) suite. The tasks in the to-do list will provide the perfect context for filling out Github pull_request_template.md.. As a best practice, always have two reviewers or have another code owner to review.

If tests fail, it’s debugging time. We spin up a fresh chat thread containing the expected behaviour, the actual error, and full logs. Do a Git checkpoint. Clean context helps tremendously to increase the signal-to-noise ratio instead of using the full context. Give it a few tries, and if the coding agent cannot solve this, time to switch roles and take the drivers seat!

Coding Agent for Code Understanding

One core tasks we as software engineers commonly need to do is to be quickly onboarded to an code base and make contributions. Coding agents shines here; it can parse through code bases at a very fast pace, understand its structured and explain how it works. As an example, I’ve wrote Poking Around Claude Code despite that TypeScript/JavaScript is not my daily programming language.

In software engineering, we very commonly describe our program in graphical diagrams. The following are common diagrams used to understand how a code base work. These ranges from high level architecture diagrams to state machine diagrams (agents!!). Or, for a specific workflow, e.g., login or checkout, sequence diagrams. Or even, entity relationship diagram to understand the data modeling. We can ask coding agent to generate all of these diagrams for our own information.


  NOTE: we can also include these in the rules files to provide coding agents more bounded guidelines on how to contribute to our code bases.



  
    
      Diagram Type
      Primary Purpose
      Key Aspects Understood
      Common Use Cases / When Most Useful
    
  
  
    
      Architecture Diagrams
      Show high-level system structure, components, and interactions.
      Big picture, technology stack, deployment overview, system boundaries, communication paths.
      System design, onboarding new team members, technical discussions.
    
    
      Sequence Diagrams (UML)
      Illustrate object/component interactions over time for a scenario.
      Dynamic behavior, message flow between objects/components, collaboration timing.
      Analyzing specific workflows (e.g., login), debugging interactions.
    
    
      State Machine Diagrams (UML)
      Model the lifecycle/behavior of objects with distinct states.
      Object states, valid transitions between states, event handling, object lifecycle.
      Modeling objects with complex states (e.g., orders, user sessions).
    
    
      Flowcharts / Activity Diagrams (UML)
      Describe step-by-step logic of a process, algorithm, or workflow.
      Detailed procedural logic, decision points, loops, sequential steps, workflow.
      Detailing specific algorithms, business process mapping, function logic.
    
    
      Class Diagrams (UML)
      Show static structure of code (classes, attributes, methods, relationships).
      Code building blocks (OOP), data members, operations, inheritance, associations.
      Object-Oriented design, understanding codebase structure, refactoring.
    
    
      Entity-Relationship Diagrams (ERDs)
      Model the structure and relationships within a database.
      Data model, database tables/entities, columns/attributes, table relationships.
      Database design, understanding data persistence, query planning.
    
  


It take a combination of the above diagrams to fully grasp how a software system works. So be very tactical when generating these diagrams; focus on the context surrounding the incisions only to save time.

Tips and Tricks

Software Stack Selection

Large language models and their varients are, at the end of the day, machine learning models built using deep neural networks. Thus, more frequently appeared data data will tend to have higher quality outputs. For coding agents, this means that their capability will be much stronger in the common programming languages and frameworks.

This means stacks like Python FastAPI, Numpy, and PyTorch, Java Spring Boot, JavaScript Node and React, Unity C#, and Unreal Engine C++ will be much better choice for development choices vs the more recent and esoteric languages or framework choices, e.g., Julia or Mojo.


  NOTE: this does mean coding agents will NOT work well with proprietary in-house infrastructures, toolings, and platform ecosystems. They simply do not have enough knowledge of those and injecting enough knowledge in context is simply unrealistic regardless of the context length of the large language models. The models would need to be fine-tuned




Codebase and Documentation Organization

Use monorepo.

A monorepo plays to an AI coding agent’s strengths. Because every service, library, and shared schema lives in one tree, the agent can traverse the entire codebase in a single context window instead of searching, identifying, and stitching together knowledge from scattered repositories. Cross module calls, version constraints, and build rules are all visible at once, so the agent can reason about dependencies and refactor safely without brittle steps. Atomic commits touch multiple layers—data model, API, infra—without the overhead of synchronizing repo histories or coordinating inter-repo pull requests, which keeps the agent’s OODA loop tight. CI pipelines, code-search, and task orchestration are also unified, simplifying prompts and reducing toolchain sprawl.

In short, a monorepo turns the whole system into one coherent “language model playground”, allowing the agent to generate fixes and features faster and with fewer context leaks than a multi-repo setup.

Use monorepo.

Common Pitfalls


  
    
      Pitfall
      Why It Hurts
      Antidote
    
  
  
    
      Context‑window amnesia
      The larger your codebase, the more the LLM forgets previous files and rules, injecting duplicate or broken logic.
      Periodically summarise architecture in a fresh prompt; write docs the AI can ingest.
    
    
      Invisible security flaws
      Leaked secrets, unsanitised inputs, wide open CORS policies, easy for an LLM to slip in.
      Scan commits, run dependency checkers, add basic auth &amp;amp; rate limits up front.
    
  


Rules rule

Some might have spotted a major issue with the proliferation of vibe coding tools rules files. There’s NO standards with regards to rules files. Every coding agent has rule files per its own definition and setup. It is becoming an unmanagable mess. The best practice to manage this is to use one of them as the original, and use symbolic links (symlinks) to propagate for other code editors.



To regain sanity, please see this repository for better setup examples with symlinks.

Conclusion

Embrace the vibes and keep your helmets on.

Vibe coding is the fastest bridge between idea and interactive demo the software world has ever seen. Used thoughtfully, it’s a super accelerant for learning and creativity. Ignore its limits, and you’ll crash into the same walls that disciplined engineering has been avoiding for decades.

Move fast with stable infrastructure. Stay curious. Read the diff.

References

  For Writing Software, a Buddy System
  OpenAI Codex
  Github Copilot
  Claude 3.5 Sonnet
  Claude Code: Best practices for agentic coding
  https://twitter.com/karpathy/status/1827921103093932490
  https://x.com/karpathy/status/1886192184808149383


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Vibe Coding 101 for Software Engineers},
    year = {2025},
    month = {05},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/05/04/vibe-coding/}
}



        </description>

        <pubDate>Sun, 04 May 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/05/04/vibe-coding/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/05/04/vibe-coding/</guid>

      </item>

    

      <item>

        <title>When Prompt Deployment Goes Wrong: MLOps Lessons from ChatGPT’s &apos;Sycophantic&apos; Rollback</title>

        <description>

          How OpenAI&apos;s GPT-4o &apos;Sycophancy&apos; Glitch Underscores Critical MLOps and AI Safety Lessons for Reliable AI Systems Deployment - 

          GPT-4o Sycophancy Incident
On April 25, 2025, OpenAI shipped a GPT-4o update. Power user communities immediately noticed a change in ChatGPT’s persona. Power user communities immediately noticed a change in ChatGPT’s persona. ChatGPT had become sycophantic, constantly responding with overt flattery and showering users with adulation. Later that day, Sam Altman, OpenAI’s CEO, acknolwedged this new behavior as ‘glazes’ (contextually, and NSFW). On April 28, 2025, OpenAI began to roll out the fixes. On April 29, OpenAI released a report, admitting the release “focused too much on short-term feedback” and produced “overly flattering but disingenuous” answers. There were no post-mortem reports.

Cause
The cause was a change in the system prompt that led to these unintended and unexpected behavioral effects. Please see the system prompt below (credit to @elder_plinius for discovery and @simonw for recording).



Sycophantic Response Examples





Machine Learning Operations (MLOps) for AI Systems

Unfortunately, this is a classic case of MLOps failure.

The AI/Machine Learning industry has developed best practices for production-grade machine learning systems. This includes registering machine learning and deep learning models to control their versioning and releases. While newer terms like LLMOps, AIOps, and AgentOps are emerging to address specific nuances of Large Language Models (LLMs) and AI agents, the core principles remain rooted in solid MLOps discipline.

MLOps Tip #1: Register Models, Prompts, Decoding Parameters
This process requires adaptation for LLMs. LLMs and their derivatives (like vision language models, large reasoning models, etc.) are autoregressive and take input context, often in the form of a prompt template. This input context alters the weights of the attention matrix, effectively constructing the circuitry to guide the LLM toward generating the desired output.

Therefore, for AI/ML products built on LLMs, we now need to register both the model and the prompt template, along with any specific decoding parameters used. A change in any of these components should be treated as a new release candidate requiring proper validation.

This means:

  Model weights, decoding parameters, and prompts co-evolve.
  The combinations of these artefacts must be versioned, tested, and rolled out with the same discipline we apply to traditional machine learning models, container images, or microservices.


The GPT-4o sycophancy incident illustrates the significant impact of neglecting MLOps discipline, even with what might seem like a “low-risk” prompt change. This isn’t just a matter of ‘prompt engineering’, ‘guardrails’, ‘LLMOps’, ‘AIOps’, ‘AgentOps’, or other hype word of the day. This is a fundamental MLOps discipline failure affecting overall AI Safety.

MLOps Tip #2: Safeguard Operations with Deployment Strategies

The industry has developed several mature deployment strategies for machine learning and AI models. Below is a table showcasing how different approaches could have potentially caught this error and minimized the impact, instead of deploying a sycophantic AI to over 180 million monthly active users (MAU).


  
    
      Pattern
      What it Does
      Would it Have Helped Here?
    
  
  
    
      Shadow Deployment
      New prompt receives a copy of real traffic but never exposes responses to users; logs are compared offline.
      Likely would have highlighted the excessive-praise distribution shift before user exposure.
    
    
      Canary Deployment
      Serve new prompt to 1–5 % of real users, monitor live metrics, auto-rollback on anomalies.
      Would have limited the blast radius to a tiny cohort instead of 180M monthly active users.
    
    
      A/B / Online Eval
      Split traffic, track long-term user-satisfaction, retention, abuse flags.
      Key metric like “creepiness” or “sincerity” would have trended negative over several sessions, triggering rollback sooner.
    
  


MLOps Tip #3: Retrospective, from Outside In

Here’s a breakdown of where the process likely failed, viewed through an MLOps lens:

  Metrics Myopia: OpenAI seemingly optimized for near-term engagement signals (like thumbs-up) instead of longitudinal user satisfaction or qualitative metrics. This allowed sycophancy, which might garner initial positive reactions, to look “good” in offline or early online dashboards, masking the negative long-term impact. This mirrors issues seen in other domains, like social media platforms over-optimizing for click-through rates.
  Insufficient Progressive Rollout: The widespread and immediate nature of the change suggests a lack of a sufficiently cautious staged rollout (like a canary release). A full, immediate rollout meant social media backlash effectively became the primary alerting system, rather than internal monitoring.
  Prompt Not Treated as a First-Class Artefact: We speculate that system prompts might not have been fully integrated into either the primary software development lifecycle (SDLC) or the model development lifecycle (MDLC) pipelines. This could mean prompt changes might circumvent automated testing suites, AI safety checks, and mandatory manual approvals required for code or model updates.


Takeaways

  Register Prompts as Critical Artifacts: Treat prompts with the same rigor as models and decoding parameters. Anything that influences model behavior must be versioned, tested, and tracked as a deployable artefact within your MLOps/LLMOps framework.
  Progressive Releases: Shadow, canary release, and A/B testing should be the default ode of release, not optional or afterthought. This is fundamental to operational stability and AI Safety.
  Optimize Metrics for the Right Horizon. Ensure your evaluation metrics capture long-term user value and safety, not just immediate engagement. This applies to prompt tuning, fine-tuning, and reinforcement learning (RLHF). Reward models should weigh session-level and longer-horizon feedback, not just the first response.
  Human Feedback IS NOT Ground Truth, Human feedbacks tend to be very noisy labels and cannot be used as ground truth. Validate the hujan feedbacks with orthongonal evaluations such as red teaming and safety.


Conclusion

The GPT-4o sycophany incident is a reminder that operational rigor keeps LLM features trustworthy at scale. When a friendly tweak can accidentally flatter a hundred millions users into discomfort, MLOps discipline becomes mission critical. Treat prompts as model, deploy progressively, and let metrics determine when to roll back.

References


  Sycophancy in GPT-4o: What happened and what we’re doing about it
  https://x.com/dioscuri/status/1916865608982946105
  https://x.com/TrungTPhan/status/1916860787601138104


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {When Prompt Deployment Goes Wrong: MLOps Lessons from ChatGPT’s &apos;Sycophantic&apos; Rollback},
    year = {2025},
    month = {04},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/04/30/ai-ml-llm-ops/}
}



        </description>

        <pubDate>Wed, 30 Apr 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/04/30/ai-ml-llm-ops/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/04/30/ai-ml-llm-ops/</guid>

      </item>

    

      <item>

        <title>The Model is the Product</title>

        <description>

          Data Council 2025 - 

          I had the opportunity to give a talk at the Data Council 2025, The title of the talk is the Model is the Product.

In the realm of machine learning, AI, and deep learning, the intelligence embedded within a system—the model—stands as the primary product and key differentiator. This talk explores how the intelligence component has evolved to become the central selling point across technological eras. We will examine the historical progression of how intelligence capabilities have increasingly defined product value, transforming from hardware differentiators like “Intel Inside” during the PC era, to software advantages, and now to model-centric offerings in today’s AI landscape. The intelligence layer has become not just a feature but the core product itself. Additionally, we’ll analyze how the definition of “model” itself has evolved alongside technological advancement, reshaping what constitutes a system’s core value. Companies now face a strategic bifurcation: pursue a model-centric approach or focus on distribution-centered strategies. Each path carries distinct trade-offs, risks, and opportunities in today’s competitive AI marketplace. Through case studies of industry leaders and emerging players, we’ll demonstrate how the fundamental principle—”the model is the product, the distribution is the moat”—is reshaping competitive dynamics and business strategies across sectors.


        </description>

        <pubDate>Wed, 23 Apr 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/talks/2025/04/23/the-model-is-the-product/</guid>

      </item>

    

      <item>

        <title>Pro Tips: Model Context Protocol developement on Windows</title>

        <description>

          Pain and suffering with the best version of Linux on WSL - 

          Model Context Protocol (MCP) development resources for Windows are scarce, so here are some practical tips and troubleshooting methods to enhance your experience.

Outside the MCP Inspector, Claude Desktop is one of the most effective tools to quickly test and demonstrate MCP server operations. However, setting it up on Windows with Windows Subsystem for Linux (WSL) presents unique challenges. Let’s address these challenges and simplify your workflow.



Setting Up Claude Desktop for MCP Server Development

Locate Claude Desktop Configuration Files
Press Win + R, type %APPDATA%\Claude, and open the Claude Desktop settings directory. Two key files here are:

  claude_desktop_config.json
  developer_setting.json


Claude Desktop Developer Settings
Update developer_setting.json to enable developer tools:
{
  &quot;allowDevTools&quot;: true
}

Adding MCP Servers to Claude Desktop
While adding MCP servers directly on Windows is straightforward (see the official documentationdocumentation, doing so within WSL requires additional configuration gynmastics.

Edit claude_desktop_config.json to integrate your MCP server. Follow these guidelines based on your server type:


  JavaScript/TypeScript servers: locate npx with whereis npx in WSL, then configure:
    {
&quot;mcpServers&quot;: {
  &quot;your_mcp_server_name&quot;: {
    &quot;command&quot;: &quot;wsl.exe&quot;,
    &quot;args&quot;: [
      &quot;bash&quot;,
      &quot;-c&quot;,
      &quot;ENV_VAR_FOR_MCP_SERVER=value /path/to/npx /path/to/your/mcp/server&quot;
    ]
  }
}
}
    
  
  Python Servers: Locate uvx with whereis uvx in WSL, then configure:
    {
&quot;mcpServers&quot;: {
  &quot;your_mcp_server_name&quot;: {
    &quot;command&quot;: &quot;wsl.exe&quot;,
    &quot;args&quot;: [
      &quot;bash&quot;,
      &quot;-c&quot;,
      &quot;ENV_VAR_FOR_MCP_SERVER=value /path/to/uvx /path/to/your/mcp/server&quot;
    ]
  }
}
}
    
    Alternatively, if your MCP Server has proper pyproject.toml, you may use uv run or python -m instead of uvx.
  


Claude Desktop Debugging Tools

Accessing Logs

Claude Desktop logs are stored in %APPDATA%\Claude\logs. Relevant files include:


  mcp.log (console logs from Claude Desktop, the MCP host)
  &amp;lt;mcp-server-name&amp;gt;-mcp.log (logs specific to your MCP server)


Example mcp.log entry from [bing-seearch-mcp`](https://github.com/leehanchung/bing-search-mcp)
2025-04-02T04:04:09.051Z [info] [bing-search-mcp] Initializing server...
2025-04-02T04:04:09.074Z [info] [bing-search-mcp] Server started and connected successfully
2025-04-02T04:04:09.078Z [info] [bing-search-mcp] Message from client: {&quot;method&quot;:&quot;initialize&quot;,&quot;params&quot;:{&quot;protocolVersion&quot;:&quot;2024-11-05&quot;,&quot;capabilities&quot;:{},&quot;clientInfo&quot;:{&quot;name&quot;:&quot;claude-ai&quot;,&quot;version&quot;:&quot;0.1.0&quot;}},&quot;jsonrpc&quot;:&quot;2.0&quot;,&quot;id&quot;:0}
2025-04-02T04:04:09.571Z [info] [bing-search-mcp] Message from server: {&quot;jsonrpc&quot;:&quot;2.0&quot;,&quot;id&quot;:0,&quot;result&quot;:{&quot;protocolVersion&quot;:&quot;2024-11-05&quot;,&quot;capabilities&quot;:{&quot;experimental&quot;:{},&quot;prompts&quot;:{&quot;listChanged&quot;:false},&quot;resources&quot;:{&quot;subscribe&quot;:false,&quot;listChanged&quot;:false},&quot;tools&quot;:{&quot;listChanged&quot;:false}},&quot;serverInfo&quot;:{&quot;name&quot;:&quot;bing-search&quot;,&quot;version&quot;:&quot;1.6.0&quot;}}}
2025-04-02T04:04:09.642Z [info] [bing-search-mcp] Message from client: {&quot;method&quot;:&quot;notifications/initialized&quot;,&quot;jsonrpc&quot;:&quot;2.0&quot;}
2025-04-02T04:04:09.645Z [info] [bing-search-mcp] Message from client: {&quot;method&quot;:&quot;tools/list&quot;,&quot;params&quot;:{},&quot;jsonrpc&quot;:&quot;2.0&quot;,&quot;id&quot;:1}
2025-04-02T04:04:09.651Z [info] [bing-search-mcp] Message from server: {&quot;jsonrpc&quot;:&quot;2.0&quot;,&quot;id&quot;:1,&quot;result&quot;:{&quot;tools&quot;:[{&quot;name&quot;:&quot;bing_web_search&quot;,&quot;description&quot;:&quot;Performs a web search using the Bing Search API for general information\n    and websites.\n\n    Args:\n        query: Search query (required)\n        count: Number of results (1-50, default 10)\n        offset: Pagination offset (default 0)\n        market: Market code like en-US, en-GB, etc.\n    &quot;,&quot;inputSchema&quot;:{&quot;properties&quot;:{&quot;query&quot;:{&quot;title&quot;:&quot;Query&quot;,&quot;type&quot;:&quot;string&quot;},&quot;count&quot;:{&quot;default&quot;:10,&quot;title&quot;:&quot;Count&quot;,&quot;type&quot;:&quot;integer&quot;},&quot;offset&quot;:{&quot;default&quot;:0,&quot;title&quot;:&quot;Offset&quot;,&quot;type&quot;:&quot;integer&quot;},&quot;market&quot;:{&quot;default&quot;:&quot;en-US&quot;,&quot;title&quot;:&quot;Market&quot;,&quot;type&quot;:&quot;string&quot;}},&quot;required&quot;:[&quot;query&quot;],&quot;title&quot;:&quot;bing_web_searchArguments&quot;,&quot;type&quot;:&quot;object&quot;}},{&quot;name&quot;:&quot;bing_news_search&quot;,&quot;description&quot;:&quot;Searches for news articles using Bing News Search API for current\n    events and timely information.\n\n    Args:\n        query: News search query (required)\n        count: Number of results (1-50, default 10)\n        market: Market code like en-US, en-GB, etc.\n        freshness: Time period of news (Day, Week, Month)\n    &quot;,&quot;inputSchema&quot;:{&quot;properties&quot;:{&quot;query&quot;:{&quot;title&quot;:&quot;Query&quot;,&quot;type&quot;:&quot;string&quot;},&quot;count&quot;:{&quot;default&quot;:10,&quot;title&quot;:&quot;Count&quot;,&quot;type&quot;:&quot;integer&quot;},&quot;market&quot;:{&quot;default&quot;:&quot;en-US&quot;,&quot;title&quot;:&quot;Market&quot;,&quot;type&quot;:&quot;string&quot;},&quot;freshness&quot;:{&quot;default&quot;:&quot;Day&quot;,&quot;title&quot;:&quot;Freshness&quot;,&quot;type&quot;:&quot;string&quot;}},&quot;required&quot;:[&quot;query&quot;],&quot;title&quot;:&quot;bing_news_searchArguments&quot;,&quot;type&quot;:&quot;object&quot;}},{&quot;name&quot;:&quot;bing_image_search&quot;,&quot;description&quot;:&quot;Searches for images using Bing Image Search API for visual content.\n\n    Args:\n        query: Image search query (required)\n        count: Number of results (1-50, default 10)\n        market: Market code like en-US, en-GB, etc.\n    &quot;,&quot;inputSchema&quot;:{&quot;properties&quot;:{&quot;query&quot;:{&quot;title&quot;:&quot;Query&quot;,&quot;type&quot;:&quot;string&quot;},&quot;count&quot;:{&quot;default&quot;:10,&quot;title&quot;:&quot;Count&quot;,&quot;type&quot;:&quot;integer&quot;},&quot;market&quot;:{&quot;default&quot;:&quot;en-US&quot;,&quot;title&quot;:&quot;Market&quot;,&quot;type&quot;:&quot;string&quot;}},&quot;required&quot;:[&quot;query&quot;],&quot;title&quot;:&quot;bing_image_searchArguments&quot;,&quot;type&quot;:&quot;object&quot;}}]}}


Chromium Developer Tools

Claude Desktop is an Electron app with Chromium Developer Tools built-in. This is the same tool as the Developer Tools on Chrome browser. To open these tools on Windows, use:

CTRL + Shift + ALT + I

You will see familar browser developer tools. This should help significantly in debugging your MCP servers. Oh, and by the way, Anthropics is hiring.



References


  Bing Search MCP
  Model Control Protocol Debugging


@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Pro Tips: Model Context Protocol developement on Windows},
    year = {2025},
    month = {04},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/04/01/mcp-development-windows/}
}



        </description>

        <pubDate>Tue, 01 Apr 2025 00:00:00 +0000</pubDate>

        <link>https://leehanchung.github.io/blogs/2025/04/01/mcp-development-windows/</link>

        <guid isPermaLink="true">https://leehanchung.github.io/blogs/2025/04/01/mcp-development-windows/</guid>

      </item>

    

  </channel>

</rss>

