🤖 AI Summary
Researchers from DeepMind and OpenAI (DeepRD) and several follow-ups show that modern “reasoning” LLMs don’t degrade gradually with harder tasks — they hit a sharp complexity cliff. Models can sustain high accuracy across a narrow complexity band (e.g., ~85% up to ~10–12 reasoning steps) and then collapse to near-random performance by ~15 steps. Benchmarks like MMLU hide this because they probe a limited band. Related studies expose three linked failures: composition (AgentCoMa found mixing two 90% capabilities can drop overall accuracy by ~30%), chain-of-thought (CoT) can actively hurt domain tasks (86.3% of models did worse on medical diagnosis with CoT), and “gaslighting” attacks flip correct answers 25–53% of the time. Simply throwing compute at inference also fails economically: Quiet-STaR spent ~$200 and 12.7 trillion FLOPs per problem for 32% accuracy and seconds-to-minutes runtimes comparable to humans at far greater cost.
The takeaway for practitioners and researchers is stark: current architectures appear to pattern-match until a hard limit, not reason compositionally. Promising fixes aren’t scale-ups but new paradigms — meta-learning compositional primitives (a 5.7M meta-learned model matched 8B+ baselines on some tasks), architectural changes (discrete bottlenecks, recursive latents), and domain-specific reasoning modules (MERRY, SWiRL) that yield 15–30% gains; SWiRL reports ~21.5% math improvement. Practically, teams should test at complexity boundaries, route problems to specialized modules, plan for sudden failure modes, and prioritize robustness and compositional designs over more inference compute.
Loading comments...
login to comment
loading comments...
no comments yet