LLMs Suck at Deep Thinking Part 3 (www.lesswrong.com)

🤖 AI Summary
Researcher Taylor Gordon Lunt ran a 90-game pilot tournament (six round-robins of six players) pitting humans, a Random baseline, and four successive OpenAI LLMs (GPT-3.5-Turbo, GPT-4o, GPT-4.1, GPT-5) across three classic board games (9x9 Go, Chess, Connect-4) and three novel analogs (Fantastical Chess, Shove, Hugs & Kisses). Moves were time-limited (LLMs <10 min, humans <5 min) and illegal-move chains could forfeit a match. Outcomes were scored as win/draw points and converted into Elo-like ratings via Bradley–Terry (choix, reg=0.01) then 1500+400*BT. Results: later models consistently beat earlier ones, GPT-5 achieved the highest overall score and a ~30% raw win rate (outperforming humans overall), while GPT-3.5 sometimes couldn’t even beat Random. Crucially, humans outperformed LLMs on the computationally complex games (higher branching factor ~35 for chess variants) while LLMs dominated simple games (branching <7), and there was no clear advantage for AIs on classic vs. novel games. The experiment provides suggestive evidence that recent LLM gains primarily improve “shallow” pattern-matching and heuristic processing, not the heavy, search-like “deep thinking” required to explore vast combinatorial spaces and form novel strategies. That explains the smaller inter-model spread on complex games and frequent catastrophic blunders (e.g., blundering a queen) even in advanced models. Implications for the ML community: current benchmarks may over-reward shallow competence, so claims of near-term AGI risk overestimation unless architectures or training regimes enable sustained deliberative search, learning-from-experience, or other mechanisms to close the deep-thinking gap. The author notes this is a limited pilot and calls for larger, more rigorous benchmarks targeting deep deliberation.
Loading comments...
loading comments...