GPT-4o Sycophancy Incident

On April 25, 2025, OpenAI shipped a GPT-4o update. Power user communities immediately noticed a change in ChatGPT’s persona. Power user communities immediately noticed a change in ChatGPT’s persona. ChatGPT had become sycophantic, constantly responding with overt flattery and showering users with adulation. Later that day, Sam Altman, OpenAI’s CEO, acknolwedged this new behavior as ‘glazes’ (contextually, and NSFW). On April 28, 2025, OpenAI began to roll out the fixes. On April 29, OpenAI released a report, admitting the release “focused too much on short-term feedback” and produced “overly flattering but disingenuous” answers. There were no post-mortem reports.

Cause

The cause was a change in the system prompt that led to these unintended and unexpected behavioral effects. Please see the system prompt below (credit to @elder_plinius for discovery and @simonw for recording).

Sycophantic Response Examples

alt text
alt text
alt text

Machine Learning Operations (MLOps) for AI Systems

Unfortunately, this is a classic case of MLOps failure.

The AI/Machine Learning industry has developed best practices for production-grade machine learning systems. This includes registering machine learning and deep learning models to control their versioning and releases. While newer terms like LLMOps, AIOps, and AgentOps are emerging to address specific nuances of Large Language Models (LLMs) and AI agents, the core principles remain rooted in solid MLOps discipline.

MLOps Tip #1: Register Models, Prompts, Decoding Parameters

This process requires adaptation for LLMs. LLMs and their derivatives (like vision language models, large reasoning models, etc.) are autoregressive and take input context, often in the form of a prompt template. This input context alters the weights of the attention matrix, effectively constructing the circuitry to guide the LLM toward generating the desired output.

Therefore, for AI/ML products built on LLMs, we now need to register both the model and the prompt template, along with any specific decoding parameters used. A change in any of these components should be treated as a new release candidate requiring proper validation.

This means:

  • Model weights, decoding parameters, and prompts co-evolve.
  • The combinations of these artefacts must be versioned, tested, and rolled out with the same discipline we apply to traditional machine learning models, container images, or microservices.

The GPT-4o sycophancy incident illustrates the significant impact of neglecting MLOps discipline, even with what might seem like a “low-risk” prompt change. This isn’t just a matter of ‘prompt engineering’, ‘guardrails’, ‘LLMOps’, ‘AIOps’, ‘AgentOps’, or other hype word of the day. This is a fundamental MLOps discipline failure affecting overall AI Safety.

MLOps Tip #2: Safeguard Operations with Deployment Strategies

The industry has developed several mature deployment strategies for machine learning and AI models. Below is a table showcasing how different approaches could have potentially caught this error and minimized the impact, instead of deploying a sycophantic AI to over 180 million monthly active users (MAU).

Pattern What it Does Would it Have Helped Here?
Shadow Deployment New prompt receives a copy of real traffic but never exposes responses to users; logs are compared offline. Likely would have highlighted the excessive-praise distribution shift before user exposure.
Canary Deployment Serve new prompt to 1–5 % of real users, monitor live metrics, auto-rollback on anomalies. Would have limited the blast radius to a tiny cohort instead of 180M monthly active users.
A/B / Online Eval Split traffic, track long-term user-satisfaction, retention, abuse flags. Key metric like “creepiness” or “sincerity” would have trended negative over several sessions, triggering rollback sooner.

MLOps Tip #3: Retrospective, from Outside In

Here’s a breakdown of where the process likely failed, viewed through an MLOps lens:

  1. Metrics Myopia: OpenAI seemingly optimized for near-term engagement signals (like thumbs-up) instead of longitudinal user satisfaction or qualitative metrics. This allowed sycophancy, which might garner initial positive reactions, to look “good” in offline or early online dashboards, masking the negative long-term impact. This mirrors issues seen in other domains, like social media platforms over-optimizing for click-through rates.
  2. Insufficient Progressive Rollout: The widespread and immediate nature of the change suggests a lack of a sufficiently cautious staged rollout (like a canary release). A full, immediate rollout meant social media backlash effectively became the primary alerting system, rather than internal monitoring.
  3. Prompt Not Treated as a First-Class Artefact: We speculate that system prompts might not have been fully integrated into either the primary software development lifecycle (SDLC) or the model development lifecycle (MDLC) pipelines. This could mean prompt changes might circumvent automated testing suites, AI safety checks, and mandatory manual approvals required for code or model updates.

Takeaways

  1. Register Prompts as Critical Artifacts: Treat prompts with the same rigor as models and decoding parameters. Anything that influences model behavior must be versioned, tested, and tracked as a deployable artefact within your MLOps/LLMOps framework.
  2. Progressive Releases: Shadow, canary release, and A/B testing should be the default ode of release, not optional or afterthought. This is fundamental to operational stability and AI Safety.
  3. Optimize Metrics for the Right Horizon. Ensure your evaluation metrics capture long-term user value and safety, not just immediate engagement. This applies to prompt tuning, fine-tuning, and reinforcement learning (RLHF). Reward models should weigh session-level and longer-horizon feedback, not just the first response.
  4. Human Feedback IS NOT Ground Truth, Human feedbacks tend to be very noisy labels and cannot be used as ground truth. Validate the hujan feedbacks with orthongonal evaluations such as red teaming and safety.

Conclusion

The GPT-4o sycophany incident is a reminder that operational rigor keeps LLM features trustworthy at scale. When a friendly tweak can accidentally flatter a hundred millions users into discomfort, MLOps discipline becomes mission critical. Treat prompts as model, deploy progressively, and let metrics determine when to roll back.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {When Prompt Deployment Goes Wrong: MLOps Lessons from ChatGPT’s 'Sycophantic' Rollback},
    year = {2025},
    month = {04},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2025/04/30/ai-ml-llm-ops/}
}