AI limitations: Apple research finds “complete accuracy collapse” for LRMs with complex problems

AI limitations: Apple research finds “complete accuracy collapse” for LRMs with complex problems

LRMs struggled with more complex iterations of the Tower of Hanoi puzzle. Photo: House of Marbles.

The limitations of more sophisticated artificial intelligence (AI) have been laid bare by a recent study from Apple (NASDAQ: AAPL), which found the performance of Large Reasoning Models (LRM) from industry leaders like OpenAI, Anthropic, Google and DeepSeek falls to zero with tasks beyond a certain level of complexity.

While it is well known that simpler Large Language Models (LLMs) are non-thinking and pattern replicating (and prone to hallucinate without sufficient safeguards), researchers from Apple have sought to assess the reasoning capability of more advanced LRMs.

In its paper The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple researchers investigated whether LRMs are capable of generalisable reasoning, or if they too are just leveraging different forms of pattern matching.

Apple looked at models such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking and Gemini Thinking, which it claims have "demonstrated promising results across various reasoning benchmarks", backed by their 'thinking' mechanisms such as long Chain-of-Thought (CoT) with self-reflection.

"Their emergence suggests a potential paradigm shift in how LLM systems approach complex reasoning and problem-solving tasks, with some researchers proposing them as significant steps toward more general artificial intelligence capabilities," the report authors wrote.

"Despite these claims and performance advancements, the fundamental benefits and limitations of LRMs remain insufficiently understood."

Apple tested LRM and LLM models with a variety of tasks including puzzles such as the Hanoi Tower, whereby a series of disks of different diameters are stacked on a rod in a conical shape, and the player must move the disks to the same formation on one of two other rods within certain rules. The more disks, the more complex the task.

Other tasks included the well-known river crossing puzzle, a one-dimensional puzzle of checker jumping, and a block-stacking puzzle.

What the study discovered was that LRMs tended to overthink on simpler tasks which were performed more efficiently by LLMs, but the more advanced models did start to outperform LLMs on tasks of medium difficulty.

But it was what happened when the tasks got too complex that has the tech world talking, as both LRMs and LLMs experience "complete collapse" with high-complexity tasks. The experiments have cast doubt over the idea that AI can 'think', revealing its limited self-correction capabilities.

"Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalisable reasoning capabilities beyond certain complexity thresholds," the authors wrote.

"Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs.

"Our detailed analysis of reasoning traces further exposed complexity dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure on complex ones."

The researchers also found that LRMs have limitations in exact computation, failing to use explicit algorithms and reasoning inconsistently across puzzles.

They claim their findings challenge prevailing assumptions about LRM capabilities and suggest current approaches may be encountering fundamental barriers to generalisable reasoning.

Get our daily business news

Sign up to our free email news updates.

The financial case for knockdown rebuild on established Australian land
Partner Content
For most Australian homeowners, the house gets the attention and the land gets taken fo...
Ventures & Visionaries
Advertisement

More News

Equity Trustees abandons superannuation management in fallout from First Guardian collapse

Equity Trustees abandons superannuation management in fallout from First Guardian collapse

EQT Holdings (ASX: EQT) has announced its Equity Trustees subsidiar...

Credit Corp abandons $385m Humm Group takeover bid, bringing seven-month battle to a close

Credit Corp abandons $385m Humm Group takeover bid, bringing seven-month battle to a close

Debt collections  group Credit Corp (ASX: CCP) has walked away...

Green360 expands concrete platform with silica fume replacement as coal plant closures bite

Green360 expands concrete platform with silica fume replacement as coal plant closures bite

Green cement materials company Green360 Technologies (ASX: GT3) has...

Metcash profit dips as resilient food earnings offset by weaker growth in liquor and hardware sales

Metcash profit dips as resilient food earnings offset by weaker growth in liquor and hardware sales

Wholesale distribution and retail group Metcash's (ASX: MT...