AI faces "complete accuracy collapse" with complex tasks

AI limitations: Apple research finds “complete accuracy collapse” for LRMs with complex problems

By Matt Ogg

9 June 2025

AI limitations: Apple research finds “complete accuracy collapse” for LRMs with complex problems

LRMs struggled with more complex iterations of the Tower of Hanoi puzzle. Photo: House of Marbles.

The limitations of more sophisticated artificial intelligence (AI) have been laid bare by a recent study from Apple (NASDAQ: AAPL), which found the performance of Large Reasoning Models (LRM) from industry leaders like OpenAI, Anthropic, Google and DeepSeek falls to zero with tasks beyond a certain level of complexity.

While it is well known that simpler Large Language Models (LLMs) are non-thinking and pattern replicating (and prone to hallucinate without sufficient safeguards), researchers from Apple have sought to assess the reasoning capability of more advanced LRMs.

In its paper The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple researchers investigated whether LRMs are capable of generalisable reasoning, or if they too are just leveraging different forms of pattern matching.

Apple looked at models such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking and Gemini Thinking, which it claims have "demonstrated promising results across various reasoning benchmarks", backed by their 'thinking' mechanisms such as long Chain-of-Thought (CoT) with self-reflection.

"Their emergence suggests a potential paradigm shift in how LLM systems approach complex reasoning and problem-solving tasks, with some researchers proposing them as significant steps toward more general artificial intelligence capabilities," the report authors wrote.

"Despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood."

Apple tested LRM and LLM models with a variety of tasks including puzzles such as the Hanoi Tower, whereby a series of disks of different diameters are stacked on a rod in a conical shape, and the player must move the disks to the same formation on one of two other rods within certain rules. The more disks, the more complex the task.

Other tasks included the well-known river crossing puzzle, a one-dimensional puzzle of checker jumping, and a block-stacking puzzle.

What the study discovered was that LRMs tended to overthink on simpler tasks which were performed more efficiently by LLMs, but the more advanced models did start to outperform LLMs on tasks of medium difficulty.

But it was what happened when the tasks got too complex that has the tech world talking, as both LRMs and LLMs experience "complete collapse" with high-complexity tasks. The experiments have cast doubt over the idea that AI can 'think', revealing its limited self-correction capabilities.

"Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalisable reasoning capabilities beyond certain complexity thresholds," the authors wrote.

"Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs.

"Our detailed analysis of reasoning traces further exposed complexity dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure on complex ones."

The researchers also found that LRMs have limitations in exact computation, failing to use explicit algorithms and reasoning inconsistently across puzzles.

They claim their findings challenge prevailing assumptions about LRM capabilities and suggest current approaches may be encountering fundamental barriers to generalisable reasoning.