OpenAI released the long-awaited GPT-o1 preview on September 12, 2024. This model was previously known as Q* in 2023 and superceded by Project Strawberry in 2024. In this first installment of the Reasoning series, we aim to separate rumors from facts about how the GPT-o1 model works and validate our conjectures through experiments. This will help users better understand and utilize GPT-o1.

Reasoning

In Thinking, Fast and Slow, Daniel Kahneman defined System 1 as the automatic, intuitive mode of thinking, and System 2 as the slower, more analytical mode. In the context of autoregressive language models, the usual inference process is akin to System 1—models generate answers directly.

Reasoning, however, gives models the ability to perform System 2 thinking by introducing “reasoning tokens.” This is similar to the Zero-Shot Chain of Thought (CoT@0) approach by Kojima 2022, where we prompt a model to “think step by step” before answering a question. The o1 model formalizes this concept by providing a “scratchpad” Nye 2021, allowing models to reason more effectively, especially for tasks that benefit from more thinking time, measured in reasoning tokens.

alt text

For straightforward questions, like “What’s the capital of France?”, additional reasoning time isn’t needed. But for complex tasks, such as coding or math problems, reasoning makes a significant difference. For example, DeepMind’s AlphaGo benefited from deep thinking time through Monte Carlo Tree Search, which was especially useful for its adversarial Go matches. Similarly, GPT-o1 shines in tasks requiring structured problem-solving, while its performance on open-ended tasks (like writing) might not differ much from GPT-4o.

In other words, instead of simply “speaking” directly like GPT-4o, the GPT-o1 model “thinks” before it “speaks.”
alt text

To obscure the internal reasoning process, reasoning steps are hidden from users. On ChatGPT, only the titles of reasoning steps are streamed, and their contents are summarized. For API calls, the reasoning steps are not visible, with only the count of reasoning_tokens are provided at the end of an inference.

The internal reasoning capabilities also make the model respond poorly to explicit chain-of-thought prompts like “think step by step” or “explain your reasoning.” In a simple experiment, we asked GPT-o1: “Which is larger, 9.11 or 9.8? Answer only from 9.11 or 9.8.” While other language models often get this wrong, GPT-o1 answered correctly around 80% of the time—as long as it had enough reasoning tokens. However, when explicitly asked to “think step by step,” accuracy dropped to 20%.

alt text

It turns out that just like humans, smart reasoning models don’t like being micromanaged. And as always, irrelevant information in retrieval-augmented generation setups can hurt reasoning performance.

What Is GPT-o1?

Before the release of GPT-o1-preview, speculation about how OpenAI implemented the reasoning models was rampant. People guessed about Tree of Thought (ToT Yao 2023), Graph of Thought (GoT Besta 2023), MCTS, and other search algorithms. There were even wilder ideas about continuous learning or graph-based information retrieval — a mess of research paper titles repurposed by ChatGPT and passed around the AI influencer community.

Now that GPT-o1-preview is available, both on ChatGPT and via API, and with an official model card report, it’s time to clarify what GPT-o1 really is.

Known Facts

GPT-o1 is a Single Model

During an OpenAI AMA on Twitter, a researcher confirmed that GPT-o1 is a single model, not a system or framework.
alt text
Source: Twitter

No Inference-Time Search or Sampling for Benchmarks

The o1 model has been evaluated on tasks like Competition Math (AIME 2024) and Competition Code (CodeForces) using cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call). These metrics imply that there’s no inference-time search—meaning the model doesn’t employ strategies like Monte Carlo Tree Search or Graph of Thought Besta 2023 during reasoning. In addition, pass@1 scores should at least be equivalent or exceed cons@64 as objective oriented search should work better than repeated sampling.

alt text

Reasoning Tokens Use the Context Window

Reasoning tokens count toward the context window, reducing the budget available for prompt and output tokens. The o1-preview model can use 32,768 more reasoning tokens compared to the smaller o1-mini, indicating that reasoning steps are stored within the context window. However, this doesn’t necessarily rule out parallel reasoning.

Chain of Thought Exampels does not indicate search or backtracking

OpenAI’s research blogs on Learning to Reason with LLMs and the o1 System Card provide some full Chain of Thought examples. These don’t appear to involve search or backtracking — though they may shift perspectives or re-evaluate decisions, it all happens in a forward, sequential decision making manner. This could be an artifact of OpenAI selectively disclosing its implementation details.

alt text

Reasoning Step Titles Streamed with Summarized Content

On ChatGPT, reasoning step titles are visible while the content is dynamically summarized. So far, these titles don’t appear to be modified, suggesting limited backtracking in reasoning.

Educated Guesses

Linear Relationship Between Reasoning Tokens and Inference Time

We plotted the relationship between reasoning token and inference time using OpenAI o1 API. The API does not show specific tokens as previously mentioned, but does indicate the number of reasoning_tokens for billing purposes. To minimize the impact of pre-fill (prompts) and maximize the impact of reasoning tokens on latency, we used the above example prompt “what’s larger? 9.11 or 9.8? answer only from 9.11 or 9.8.” for short inputs and short outputs.

If, any search algorithm such as MTCS, ToT, GoT, is used and OpenAI is counting those tokens for billing purposes, the relationship between latency vs inference time would be sub-linear, i.e., we would consume a lot more tokens with not much increase in latency. The results is plotted below, and the relationship is fairly linear, indicating that no search is used at inference time.

alt text

However, we do not rule out the possibility that OpenAI is up-charging the reasoning tokens and intentionally ignoring the token used in inference time search system. This seems unlikely, as scaling law favors training models to reason instead of programming models to search for reasoning.

Conclusion

This covers the initial release and understanding of GPT-o1. In future parts of this series, we’ll delve deeper into the topic of reasoning and an attempt to replicate the functionality of o1 model.

References and Acknowledgements

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Reasoning Series, Part 1: Understanding GPT-o1},
    year = {2024},
    month = {10},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2024/10/08/reasoning-understanding-o1/}
}