A new study from Apple Inc. AAPL is stirring debate about whether AI models can genuinely reason or simply mimic intelligent behavior. By testing systems like GPT-4 variants and Claude on classic logic puzzles, the research suggests that these tools may stumble when real problem-solving is required.
What Happened: Apple released a study challenging the notion that large language models (LLMs) can logically reason through complex tasks. Ars Technica explains that by testing popular models like OpenAI's o1 and o3, Claude 3.7 Sonnet, and DeepSeek-R1 on classic logic puzzles such as Tower of Hanoi and river crossing tasks, the research team discovered that these systems often fail when they encounter unfamiliar challenges that demand systematic thinking.
Even when equipped with established algorithms, the models struggled—highlighting a key gap between performing intelligently and actually thinking logically.
"It is truly embarrassing that LLMs cannot reliably solve Hanoi," said AI researcher Gary Marcus, with co-lead Iman Mirzadeh adding the models' behavior shows "their process is not logical and intelligent."
The study also found that while some models performed better on moderately difficult tasks by implementing step-by-step reasoning, they failed completely as complexity increased, often reducing their reasoning effort instead of expanding it.
This odd drop-off in effort, despite ample computing resources, shows what the researchers call a "counterintuitive scaling limit." Inconsistencies were also seen across a variety of puzzles, suggesting the failures are task-specific rather than purely technical.
Why It Matters: Some experts are countering Apple's conclusions, arguing that the apparent reasoning failures in AI models may originate from built-in constraints rather than inherent flaws.
Pierre Ferragu, an analyst at New Street Research, said that the paper is riddled with "ontological nonsense."
Economist Kevin A. Bryan suggested that these systems are trained to use shortcuts under tight computational budgets. He and others note that internal benchmarks show models perform better when allowed more tokens, but production systems restrict this on purpose to avoid inefficiency, meaning the Apple findings might be discovering limits by design, not nature.
Others, like software engineer Sean Goedecke and AI researcher Simon Willison, question whether logic puzzles are even fair tests for language models. Goedecke described DeepSeek-R1's failure on the Tower of Hanoi as a conscious decision to avoid impractical output, not a lack of ability.
Willison added that the test may simply run into token limits, hinting that the paper is more sensational than conclusive. Even Apple's researchers admit the puzzles represent a narrow slice of reasoning challenges and caution against generalizing their results too widely.
The study comes on the heels of the Worldwide Developers Conference (WWDC), where Apple made a range of new announcements about its products. Experts noted the absence of any new AI features and expressed disappointment, downgrading the company's stock. Shares fell after the event, with many raising questions about Apple's AI future.
Price Action: Apple stock is currently trading at $198.76, down -0.01% pre-market.
Benzinga Edge Rankings show that Momentum stands at 29.72, Value at 9.02, Growth at 32.90, and Quality leads the group with a score of 76.94. For more details, click here.
Read Next:
Photo courtesy: jamesteohart / Shutterstock.com
Edge Rankings
Price Trend
© 2025 Benzinga.com. Benzinga does not provide investment advice. All rights reserved.
Trade confidently with insights and alerts from analyst ratings, free reports and breaking news that affects the stocks you care about.