Apple Researchers Reveal Fundamental Limitations in AI Reasoning Models

A new study from Apple’s machine learning team challenges prevailing assumptions about the capabilities of advanced AI reasoning systems. Published in a paper titled The Illusion of Thinking, the research reveals critical limitations in state-of-the-art Large Reasoning Models (LRMs) like Claude 3.7 Sonnet Thinking and Gemini Thinking, showing they struggle with systematic problem-solving beyond basic complexity levels.

The team evaluated frontier LRMs using customizable puzzle environments such as Tower of Hanoi, Checkers Jumping, and River Crossing problems. These settings allowed for precise control over task difficulty and required strict adherence to logical rules rather than relying on pattern recognition. The study revealed three central limitations. First, all tested models completely failed when puzzle complexity exceeded 15–20 steps. Regardless of the available computational resources, performance dropped to zero percent accuracy at higher difficulty levels, indicating a fundamental constraint in managing multi-step logic. Second, the models displayed what the researchers called an "overthinking paradox." As problems became more challenging, the solutions generated by the models grew increasingly verbose but less effective. At medium complexity levels, LRMs consumed two to three times more computational resources than standard models, while delivering only modest gains in accuracy. Finally, the models showed scaling limitations. Despite having sufficient computational budgets, they reduced their reasoning effort beyond certain complexity thresholds, as measured by the number of processing tokens. This behavior suggests inherent limits in how these systems allocate cognitive resources.

To further investigate these limitations, the study introduced a novel framework comparing LRMs with standard language models under equivalent computational conditions. At low complexity levels, standard models outperformed LRMs both in terms of accuracy—achieving 85% compared to 78%—and efficiency, using only 1,200 tokens per solution versus 4,500 for LRMs. At medium complexity, LRMs held a moderate advantage, solving 45% of problems compared to 32% for standard models. However, at high complexity, both types of models collapsed to nearly zero accuracy. Interestingly, LRMs often produced shorter and less coherent reasoning traces at these levels than they did when solving simpler problems.

The implications for AI development are significant. The study revealed that models struggled to reliably implement known algorithms such as breadth-first search, even when explicitly prompted. Their reasoning was often inconsistent, with solutions frequently violating basic puzzle rules mid-process, indicating a fragile grasp of logical constraints. Furthermore, while LRMs did exhibit some capacity for detecting errors, they often became trapped in repetitive correction loops instead of devising new strategies for solving problems.

Apple’s researchers urge caution in interpreting current benchmarking results. They argue that what appears as reasoning in LRMs might more accurately be described as constrained pattern completion, which can be effective for routine problems but proves brittle when faced with novel challenges. They emphasize that true reasoning involves the capacity to adapt solution strategies to the complexity of a problem—something current models have not yet demonstrated.

The study underscores the need for new evaluation paradigms that go beyond measuring final-answer accuracy to include analysis of the reasoning process itself. As AI systems are increasingly entrusted with critical decision-making responsibilities, understanding these fundamental limitations becomes essential for the development of reliable and transparent technologies. 

Share this post
After a Historic Turn, SK Hynix Becomes the New Market Leader in the Memory Industry
For three decades, the name Samsung was almost synonymous with leadership in the DRAM market. Now, however, the tables have turned: in the first half of 2025, South Korea’s SK Hynix surpassed its rival in the global memory industry for the first time, ending a streak of more than thirty years. This change signifies not just a shift in corporate rankings but also points to a deeper transformation across the entire semiconductor industry.
The Number of Organized Scientific Fraud Cases is Growing at an Alarming Rate
The world of science is built on curiosity, collaboration, and collective progress—at least in principle. In reality, however, it has always been marked by competition, inequality, and the potential for error. The scientific community has long feared that these pressures could divert some researchers from the fundamental mission of science: creating credible knowledge. For a long time, fraud appeared to be mainly the work of lone perpetrators. In recent years, however, a troubling trend has emerged: growing evidence suggests that fraud is no longer a series of isolated missteps but an organized, industrial-scale activity, according to a recent study.
Beyond the Hype: What Does GPT-5 Really Offer?
The development of artificial intelligence has accelerated rapidly in recent years, reaching a point where news about increasingly advanced models is emerging at an almost overwhelming pace. In this noisy environment, it’s difficult for any new development to stand out, as it must be more and more impressive to cross the threshold of user interest. OpenAI carries a double burden in this regard: not only must it continue to innovate, but it also needs to maintain its lead over fast-advancing competitors. It is into this tense landscape that OpenAI’s newly unveiled GPT-5 model family has arrived—eagerly anticipated by critics who, based on early announcements, expect nothing less than a new milestone in AI development. The big question, then, is whether it lives up to these expectations. In this article, we will examine how GPT-5 fits into the trajectory of AI model evolution, what new features it introduces, and how it impacts the current technological ecosystem.
The Most Popular Theories About the Impact of AI on the Workplace
Since the release of ChatGPT at the end of 2022, the field of AI has seen impressive developments almost every month, sparking widespread speculation about how it will change our lives. One of the central questions concerns its impact on the workplace. As fears surrounding this issue persist, I believe it's worth revisiting the topic from time to time. Although the development of AI is dramatic, over time we may gain a clearer understanding of such questions, as empirical evidence continues to accumulate and more theories emerge attempting to answer them. In this article, I’ve tried to compile the most relevant theories—without claiming to be exhaustive—as the literature on this topic is expanding by the day. The question remains: can we already see the light at the end of the tunnel, or are we still heading into an unfamiliar world we know too little about?
A Brutal Quarter for Apple, but What Comes After the iPhone?
Amid global economic and trade challenges, Apple has once again proven its extraordinary market power, surpassing analyst expectations in the third quarter of its 2025 fiscal year. The Cupertino giant not only posted record revenue for the period ending in June but also reached a historic milestone: the shipment of its three billionth iPhone. This achievement comes at a time when the company is grappling with the cost of punitive tariffs, intensifying competition in artificial intelligence, and a series of setbacks in the same field.
The Micron 9650: The World's First Commercial PCIe 6.0 SSD
In the age of artificial intelligence and high-performance computing, data speed has become critically important. In this rapidly accelerating digital world, Micron has announced a technological breakthrough that redefines our concept of data center storage. Enter the Micron 9650, the world’s first SSD equipped with a PCIe 6.0 interface—not just another product on the market, but a herald of a new era in server-side storage, offering unprecedented speed and efficiency.