Scrutinizing LLM Reasoning Models

Reasoning models have been a part of AI systems since the field’s earliest years in the mid-1950s. Initially, the term referred to the implementation of preprogrammed, rule-based functions that led to reliable outputs, with little room for generalization and flexibility. Today, it describes large language models (LLMs) that formulate responses in an open-ended manner, showing the steps behind their inference process along the way.

The recent wave of LLMs branded as “reasoning models”—including but not limited to OpenAI’s o1 and o3-mini, DeepSeek’s r1-zero and r1, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini Flash Thinking, and IBM’s Granite 3.2—use the same neural network architecture as standard LLMs, but unlike their predecessors, they’re designed to display the steps associated with an intermediate phase formally known as “reasoning traces” or “Chains of Thought” (CoT). In effect, reasoning models are LLMs that show their work as they reply to user prompts, just as a student would on a math test.

*ChatGPT shows its Chain-of-Thought before giving the final answer to a prompt.*

CoT are valuable for monitoring and safety purposes. They are meant to serve as windows into reasoning models’ internal operations, allowing users to determine whether final answers derive from logical and contextually accurate inferential processes. In the case of incorrect answers, CoT can indicate where inference went wrong. This value disappears, however, if models hallucinate their CoT or fail to disclose steps that were relevant to the final answer.

Such discrepancies are common, according to a paper from Anthropic’s Alignment Science team, “Reasoning Models Don’t Always Say What They Think.” On average, Claude 3.7 Sonnet’s CoT accurately reflected its inner workings 25% of the time, while DeepSeek R1’s were accurate 39% of the time. In experiments focusing on unethical behavior, Claude’s responses were accurate 20% of the time and R1’s were accurate 29% of the time. This means CoT can’t be treated as reliable representations of reasoning model functionality, at least for now.

In a summary of the research paper, Anthropic writes that “substantial work” remains ahead when it comes to improving CoT quality and answer accuracy. While these early investigations might be considered agenda-setting, some question the premises behind the goal of improving reasoning models via CoT monitoring.

In “(How) Do Reasoning Models Reason?,” researchers at Arizona State University write that the use of words like “reasoning” to describe AI inference derives from the imprecise (albeit deeply entrenched) analogy between AI systems and human minds. Anthropomorphic terms like “reasoning” and “faithfulness” give credence to the notion that transparent and human-legible AI inference is possible, which remains an open question. Further, this nomenclature has ethical implications: in the researchers’ words, “Deliberately making [CoT] appear more human-like is dangerous, potentially exploiting the cognitive flaws of users to convince them of the validity of incorrect answers.” The problem isn’t that reasoning models are deliberately deceiving users, they suggest, but that the rhetoric surrounding them creates an exaggerated sense of their accuracy and reliability.

Evaluating CoT

While debates over the ethics of AI nomenclature continue apace, interest in improving reasoning models is spreading across academia and industry. The first step is to develop dependable methods for determining whether CoT are truthful to models’ internal processes or not, which isn’t a straightforward task.

As Anthropic’s Alignment Science team points out in its paper, it’s impossible to directly observe what goes on inside models as they process user prompts. To circumvent this issue, they created test conditions where models were highly likely to incorporate “hint” inputs and then assessed CoT to see if the models “admitted” to using the hints. Specifically, they constructed prompt pairs comprising a baseline or “unhinted” prompt alongside a “hinted” prompt. The two prompts were identical except for the hint, so in cases where the model’s output changed between prompts to reflect the hint, researchers concluded the models incorporated the hint. Although models used the hints at a consistent rate, they rarely revealed hints in their CoT. In most settings, hint disclosure rate fell under 20%.

This experimental approach isn’t the only way to evaluate CoT quality, said IBM Fellow Kush Varshney, a research staff member and manager at the IBM Thomas J. Watson Research Center in Yorktown Heights, N.Y. Varshney explained that researchers can have the model reason through the same prompt multiple times to see if the resulting CoT follow roughly the same sequence. If there’s little consistency between the CoT, that is a sign that they’re not linked to the inference process. Another method is to look for signs of confused reasoning. “If any of the steps aren’t logically entailed in the previous ones,” Varshney said, “that’s an indication that there’s some hallucination going on,” adding that models can learn how to distinguish logical from illogical reasoning by training on synthetic data sets constructed for this purpose. This is the idea behind Granite Guardian, a class of models IBM designed to spot risky discrepancies in LLM prompts and responses.

Improving Reasoning Outcomes

Assessing CoT quality is an important step towards improving reasoning model outcomes. Other efforts attempt to grasp the core cause of reasoning hallucination. One theory suggests the problem starts with how reasoning models are trained. Among other training techniques, LLMs go through multiple rounds of reinforcement learning (RL), a form of machine learning that teaches the difference between desirable and undesirable behavior through a point-based reward system. During the RL process, LLMs learn to accumulate as many positive points as possible, with “good” behavior yielding positive points and “bad” behavior yielding negative points. While RL is used on non-reasoning LLMs, a large amount of it seems to be necessary to incentivize LLMs to produce CoT, which means that reasoning models generally receive more of it. Unfortunately, RL can encourage models to maximize points without meeting intended outcomes, a problematic form of behavior known as “reward hacking.”

Varshney explained that reward hacking is common among LLMs. “AI systems are more than happy to take shortcuts and get points or rewards in ways that the humans designing the system didn’t intend,” he said, citing a well-known example involving a video game called CoastRunners. In 2016, researchers at OpenAI used CoastRunners to test RL outcomes and found that models were able to gain high scores without completing the intended task, which was to finish a boat-racing course. This is similar to what reward hacking looks like in today’s reasoning models: during the RL phase, models learn that CoT should be lengthy, which means that they can earn points by optimizing for length at the expense of logic, legibility, and fidelity to their functionality. This may explain why some CoT appear to be unrelated to their final answer: instead of proceeding towards the answer in a direct, efficient manner, models take a byzantine path that has little to do with the target. This creates verbose CoT that aren’t necessarily appropriate for the task designated in the prompt.

If optimizing for CoT length leads to confused reasoning or inaccurate answers, it might be better to incentivize models to produce shorter CoT. This is the intuition that inspired researchers at Wand AI to see what would happen if they used RL to encourage conciseness and directness rather than verbosity. Across multiple experiments conducted in early 2025, Wand AI’s team discovered a “natural correlation” between CoT brevity and answer accuracy, challenging the widely held notion that the additional time and compute required to create long CoT leads to better reasoning outcomes. This finding remained consistent across different types of problems, said Kartik Talamadupula, Wand AI’s head of AI. “If you can reward the model for being more concise, you can show a significant increase in accuracy across domains,” he said.

A Field of Possibilities

Wand AI’s approach to improving reasoning outcomes is just one among many currently being investigated by scientists in academia and industry. Varshney’s team at IBM recently proposed a novel reasoning architecture called MemReasoner that works through an external memory module. This solution-oriented intervention imbues models with greater memory capabilities, which correlates with greater answer accuracy and capacity for generalization. Other approaches seek to maximize the amount of time models spend on inference and verify responses from an external knowledge source.

Research into reasoning models remains broad and diverse. When it comes to improving CoT and final answers, “there’s always going to be a lot of different ideas to try out,” says Varshney. “It’s just a matter of focusing on the problem and working through solutions.”

Note: The author would like to thank Kush Varshney, Kartik Talamadupula, and Nick Mattei for providing feedback on this article.

Emma Stamm is a writer, researcher, and educator based in New York City. Her interests are in philosophy and the social study of science and technology.