In our previous post, we defined “reasoning” as System 2 thinking, inspired by the book Thinking, Fast and Slow. Reasoning is crucial for enhancing the performance of large language models (LLMs) in complex scenarios and elevating their overall intelligence.

In this post, we refine our definition of reasoning for LLMs and explore how this relates to scaling their intelligence during inference time. In our next post, we will provide an overview of four ways of implementing reasoning in LLMs and compound AI systems.

It’s important to note that we’re still in the early stages of this journey, and the landscape is constantly evolving. However, the general principles here are likely to endure.

Reasoning

LLMs excel at generating human-like text because they are trained on vast amounts of written data. However, the reasoning process that humans use to write text is implicit and unrecorded, which means LLMs generally lack a well-defined reasoning process in their training.

Generating Rationales

In 2017, Ling et al discovered that having language models generate rationales before providing an answer significantly improved their performance. As demonstrated in the example below, the language model was trained with rationales, so it learns how to generate rationale before generating the final answer. The rationales defines the reasoning trajectory that guided the language model to generate the final answer. This finding was done using Long Short Term Memory (LSTMs) language models, predated the widespread use of transformers and LLMs. Despite the simplicity of asking models to produce rationales first, the results were notably better.

alt text

Scratchpads

Building on top of it, Nye et al, 2021 introduced the idea of scratchpad thinking. Instead of training language models with rationales, we now have early signs of prompt engineering where we now train transformer based decoder-only language model to first generate its processes between <scratch> and </scratch> XML tags. These types of tags are now common place in 2024 and is recommended by Anthropic to structure your prompts for prompt engineering.

alt text

Chain of Thought

In 2022, chain-of-thought prompting emerged with Wei et al, 2022 and Kojima et al, 2022. With chain-of-thought prompting, models are instructed to “think step by step” before answering. Remarkably, these techniques improved model performance, illustrating the benefit of guiding models to generate longer, more thoughtful responses.

alt text

Other methods also aim to enhance reasoning, such as self-consistency (Wang et al, 2022), where multiple answers are generated and a weighted vote determines the best one. Techniques like Tree of Thoughts (Yao et al, 2023) further expand this idea by branching reasoning steps or using search through the reasoning process. Often, these methods involve chaining multiple language model calls, a concept known as compound AI systems. Essentially, these techniques trade off more compute FLOPS and tokens for improved results.

However, not every problem benefits from extended reasoning. For example, a question like “What is the Capital of Silicon Valley?” either has a known answer or it does not — similar to simple search queries on Internet search engines like Bing or Google. On the other hand, queries that require reasoning, such as writing code, identifying errors in a block of code, editing text, or summarizing information, do benefit from extended reasoning.

“Reasoning,” in a generalized sense, can be defined as allowing the model to “think longer” before providing an answer.

This approach generalizes well across different models. For instance, in AlphaGo, the model uses Monte Carlo Tree Search (MCTS) to find the highest-value move in the search space, taking a predefined amount of time to make its decision. Human Chess players, too, have alloted thinking time budget per game between making their moves.

Thus, a model’s ability to reason is reflected in how well its capabilities scale with the allotted inference time or compute — a concept known as inference time scaling or test-time scaling.

Reasoning and Inference Time Scaling

For the past several decades, advances in computing have been driven by Moore’s Law, which states that the number of transistors in an integrated circuit doubles approximately every two years. This trend has enabled rapid growth in data storage and, in turn, fueled advancements in neural network models, starting with AlexNet in 2012 and reaching an inflection point with the release of Generative Pre-trained Transformer (GPT) in 2018.

Scaling Pre-training

The first phase following this inflection point involved scaling up compute to consume the vast amounts of internet data needed to pre-train our models. The complexity of scaling up pre-training is an non-trivial parallel computing task, ranging from setting high bandwidth interconnects on server rack backplanes, distributing traffic between cages, all the way to building distributed computing programs that can tolerate from random node failures. As we scaled compute, we observed emergent abilities (Wei et al, 2022) in these models — this was the era of scaling up pre-training.

Scaling Post-training

The next phase began around 2022, with efforts to train models to better understand and follow human instructions. It started with FLAN (Finetuned Language Net) (Wei et al, 2021), followed by Instruct GPT (Ouyang et al, 2022). And leading research labs like OpenAI and Google really scaled up the post training with Google’s Flan T5 (Chung et al, 2022) and OpenAI’s GPT-3.5 and the release of ChatGPT. The post training process takes raw pre-trained models (e.g., Llama-3.2-3B) to models tuned for understanding human instructions and preferences (e.g., Llama-3.2-3B-Instruct). This post-training process involved supervised instruction fine-tuning and reinforcement learning from human feedback, using curated expert-labeled data as well as model-generated data.

Scaling Inference Time: Reasoning

In the first two phases, we primarily focused on improving models through pre-training and post-training — all of which occur before inference. Now, we are entering a third phase: scaling model intelligence by enhancing reasoning during inference time.

Inference time scaling, also known as test-time scaling, is about allowing models to “think longer” to generate better outputs — much like how AlphaGo deliberates before making a move, or how OpenAI o1 model deliberates much longer before it creates a final answer.

In upcoming posts, we will explore different methods for implementing reasoning and how these approaches impact model performance.

References

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Reasoning Series, Part 2: Reasoning and Inference Time Scaling},
    year = {2024},
    month = {10},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2024/10/21/reasoning-inference-scaling/}
}