AI Reasoning Models (1 of 3)

The Cognitive Architecture

Jul 10, 2025

A superset of Generative AI (GenAI) Multi-Modal Model “Reasoning” (a.k.a. Test-Time Compute) aspects that are part of many current AI systems, as highlighted in pink compared to prior GenAI architectures. Training aspects of the “System 2” AI models are not shown.

I feel we've skipped an important step in AI literacy. It happened so quickly after all. Generative AI (GenAI) models are "reasoning", and AI literacy is still being discussed as if we're dealing with the same single-pass AI of a year ago. But those of you who have used AI's "reasoning" features, or related capabilities like "deep research" modes that generate full research reports know that we've passed a key threshold. No longer do people need to generate long, complicated prompts for many AI tasks. Not that prompts aren't needed, but the nature of the prompts is now different.

We've reached the point that too much is happening under the hood for many uses of "reasoning" models that, for educational purposes, mean it's no longer AI 101 material. Students and teachers will benefit from using models with such features turned off before graduating to AI models with more advanced cognitive features.

You may already be upset about me using terms like "reasoning" and "cognitive" to refer to GenAI. I'm barreling through that concern as this article describes what's really happening in AIs with such features, often in contrast to human abilities. I'm not sure those are the right terms either, but I don't have other ones. All thinking-related terms were created to refer to humans. For those reasons, I'm ditching the quotes in the rest of this article. The comparisons I make to human cognition in this article are meant to draw parallels at abstract conceptual levels, not to imply equivalence or completeness compared to human brains.

The step we've skipped often in AI literacy instruction and burgeoning curriculum is to describe what's really going on when a model says it's reasoning or contemplating, what might go wrong, and how to shape the reasoning process.

It emerged very quickly. For a while there, many claimed the AI progression narrative was set. Pundits claimed that the transformer models underpinning all AI language models (until DeepSeek) were facing rapidly diminishing returns. They argued that simply scaling up—adding more data and more computing power—wouldn't produce another significant leap.

But people don't account for innovation very well, and they were wrong. To those in the AI world the notion that the technology had hit an insurmountable roadblock was absurd. The number of untried and potentially pivotal options for architecting and training GenAI was enormous. If you were paying attention to Silicon Valley CEOs they were all pointing to the need for more data and compute, but that served many purposes for them. In the AI science circles, it was always clear the innovation train wasn't slowing down. Still isn’t.

The jump to reasoning models wasn't just another step on the scaling ladder; it was a leap to a different kind of improvement path. These systems began to solve complex problems that required multiple, coherent logical steps. It was the emergence of new processing strategies, and not the last of them.

But what's really going on, and what does it mean for AI literacy skills?

AI's System 1 and System 2

To understand this leap, and to have any hope of teaching students how to actually use these tools, we have to stop seeing AI as a monolithic oracle. Instead, we need to understand its underlying cognitive architecture, which is best explained by an analogy from the psychologist Daniel Kahneman's depiction of the brain: System 1 and System 2 thinking. AI now has both to some degree, and learning to manage the interplay between them is now a critical skill.

The part of the GenAI most are familiar with is analogous to System 1, the fast thinker. This is the intuitive, associative inference engine that was the only aspect of language models until mid-2024. In the brain, our intuitive and mainly unconscious processing does the lion's share of the cognitive work. Even things we think are decided consciously are often just reported to our consciousness; the decision was already made in our subconscious. Whereas "pattern finding" is used disparagingly in referring to GenAI, that's what our unconscious does too, functionally if not necessarily using similar methods.

System 2 is the slow, deliberative, conscious processing in our brains that we associate with reasoning and other higher-order cognitive processing. Analogously, AI's reasoning is a deliberate and iterative "inner voice." This is the AI's self-assessment engine—its deliberate, analytical, and critical inner voice. It is, in short, the system that provides judgment. It's the part of the AI that allows it to "think about its own thinking," catching the very mistakes its faster counterpart makes.

The slow thinker can catch logical fallacies, identify inconsistencies, and force the AI to backtrack from dead ends. In advanced implementations like Tree of Thoughts, it evaluates multiple reasoning paths simultaneously, pruning weak branches and developing only the most promising ideas. It's the difference between brainstorming and conducting a methodical investigation.

The big difference between our brain's System 1 and AI's is that it learns on and processes different data. GenAI processes humanity's System 2-generated text, so its intuition can be about reasoning, explanation, creativity, and argumentation that manifest textually (or now in data, audio or video forms too). Oh, and I’m not talking about either AI’s System 1 or System 2 having a form of consciousness. Consciousness is central to our experiences, and we tend to think it’s critical for intelligence. But it’s not necessary or central to the human-AI conceptual cognitive overlaps.

AI Reasoning Methods

Unlike the transformer model that underpinned most prior LLMs, every AI product has its own secret sauce when it comes to reasoning. But there are two main areas of improvement.

One improvement is to the fast inference engine, the intuitive "System 1" part of GenAIs. These can take multiple forms:

Fine tuning for reasoning criteria: After originally training the AI to predict the next text token, it is "fine-tuned"—trained again from where the original training left off. Originally this was called Reinforcement Learning from Human Feedback (RLHF), because people provided the training data for fine tuning, usually by saying what AI output they like best. But there are other possible fine-tuning criteria. In reasoning systems the fine tuning can be to have the AI learn through high quality reasoning examples.

Ensemble methods: The System 1 inference engine can also be enhanced by running it on multiple models in parallel and combining the results in some way, either by taking the most common answer (more likely to be correct for many challenges), or, in the case of DeepSeek, in combining multiple inference models, each trained to have different expertises ("Mixture of Experts" models).

Because AI's System 1 lacks an integrated mechanism for deliberate review, its impressive fluency can mask factual errors, logical fallacies, and what we've all come to know as "hallucinations." It produces a beautifully written but perhaps fundamentally flawed argument because it doesn't have a built-in method for distinguishing a brilliant insight from a plausible-sounding fabrication. It is all gas, no brakes. Left unguided, it will confidently drive off a cliff. It's learned what great reasoning looks like in its final form, but it hasn't experienced the messy, doubt-filled process of actually getting there.

But the biggest enhancements came from adding a System 2 to the AI architectures in the form of various self-assessments. These pieces are separate AIs that interact with the output from the main System 1 inference engine and provide feedback to it. Prior to these innovations the only feedback in the core AI model was that it listens to what it already wrote. Now there are other AI models that advise it on the approach and strategy too.

The reasoning system methods mostly fall into a few categories:

Chain of Thought (CoT): The AI is prompted to "show its work" by articulating each step of its reasoning process. Instead of jumping straight to an answer, the model breaks complex problems into sequential logical steps, much like a student solving a math problem on paper. This dramatically improves accuracy on multi-step reasoning tasks—but the quality depends entirely on the prompt design and the AI's ability to avoid plausible-sounding but incorrect reasoning paths.

Self-Consistency Decoding: Rather than accepting the first reasoning chain the AI produces, this approach generates multiple different solution paths for the same problem and selects the answer that appears most frequently. Think of it as consulting several independent experts and going with the consensus. Research shows this can boost performance on reasoning tasks, but it comes with significantly higher computational costs.

Tree of Thoughts (ToT): This moves beyond linear reasoning to explore multiple solution branches simultaneously, like a chess player considering several moves ahead. The AI evaluates different reasoning paths, prunes weak branches, and can backtrack when it hits dead ends. This systematic exploration is particularly powerful for complex problems that benefit from strategic lookahead, though it demands substantial computational resources.

Critic and Constitutional Models: Separate AI systems are trained specifically to evaluate and score the reasoning quality of the main model's outputs. These "critic" models learn to identify logical fallacies, factual errors, and inconsistencies, functioning as an internal quality control system. This represents a shift from single-model reasoning to multi-model collaboration, where different AIs specialize in generation versus evaluation.

The most sophisticated GenAIs might use several of these techniques. Undoubtedly AI's reasoning pieces hallucinate too, but since the patterns it is finding and applying are so abstract, we may not notice unless the hallucination mushrooms into a dysfunctional reasoning path.

The training process for these reasoning models reveals something crucial about how AI development has evolved. Unlike the early days when humans scored AI outputs to guide training, reasoning models are increasingly trained on vast datasets of AI-generated reasoning chains. The sheer volume makes human evaluation impractical—you can't hire enough humans to score millions of step-by-step reasoning attempts across complex mathematical proofs, logical puzzles, and multi-step analyses. So AI systems learn to reason by studying the reasoning patterns of other AI systems, with automated scoring mechanisms determining which reasoning chains are "good" and which are "bad." This creates a recursive feedback loop where AI reasoning capabilities emerge from the collective patterns of artificial reasoning attempts, not human judgment. The obvious risk is that these systems might learn to mimic the appearance of good reasoning—using phrases like "let me think step by step" and "upon further reflection"—without actually developing robust logical capabilities. We're essentially training AI to recognize what AI reasoning looks like, which can produce sophisticated-sounding outputs that mask fundamental logical errors.

The New Scaling Law: Test-Time Compute

These reasoning capabilities represent a fundamental shift in how AI systems allocate computational resources. Traditional AI followed a simple scaling law: train bigger models on more data to get better performance. But reasoning models have introduced a new scaling dimension: test-time compute.

Test-time compute means the AI can spend more processing power while you're using it—when it's actually answering your question—rather than just during training. Instead of giving you the first answer that comes to mind, reasoning models can deliberate, explore multiple solution paths, and self-evaluate their work before responding. This is why you might notice reasoning models taking longer to respond, especially for complex questions.

The performance gains have been dramatic. Models using reasoning approaches have shown breakthrough improvements on challenging benchmarks—solving complex math problems, writing more coherent long-form content, and handling multi-step logical puzzles that would trip up even large traditional models. In some cases, a smaller reasoning model outperforms a much larger traditional model by spending more computational effort on the thinking process itself.

But AI companies are discovering unintended consequences of how they engineer these systems. Some reasoning models become overly verbose, showing excessive "thoughts" that don't actually improve the final answer. Others develop reasoning patterns that are hard to predict or control. There's also the risk of reasoning models that sound more confident and sophisticated but are actually making more subtle errors that are harder to detect.

The computational demands are substantial. A reasoning model might use 10-100 times more compute to answer a difficult question compared to a quick response from a traditional model. Understanding when these capabilities are needed—and when they're overkill—becomes a critical skill for both educators and students.

In Part 2 of this series, I'll explore how to translate this understanding into practical classroom skills—teaching students to manage AI reasoning processes. Part 3 will dive deeper into the fundamental pattern recognition that underlies even these System 2 aspects of cognition, addressing the common dismissal of AI as "just pattern matching."

AI Reasoning Models (1 of 3)

The Cognitive Architecture

AI's System 1 and System 2

AI Reasoning Methods

The New Scaling Law: Test-Time Compute

Discussion about this post