AI is bad at math, ORCA shows (www.theregister.com)

🤖 AI Summary
Researchers from Omni Calculator and several European universities released ORCA (Omni Research on Calculation in AI), a new 500‑prompt benchmark designed to test real-world computational reasoning across seven domains (Biology & Chemistry, Engineering & Construction, Finance & Economics, Health & Sports, Math & Conversions, Physics, and Statistics & Probability). In October 2025 they evaluated five leading LLMs—Gemini 2.5 Flash, Grok 4, DeepSeek V3.2, ChatGPT‑5 and Claude Sonnet 4.5—and found accuracy between 45–63%, with the best model (Gemini 2.5 Flash) at 63% and Claude Sonnet 4.5 the weakest at 45%. Errors clustered in rounding (35%) and raw calculation mistakes (33%). ORCA’s authors argue many existing math tests are contaminated by training data leakage, so ORCA aims to measure true computational ability rather than memorized patterns; a cited Engineering prompt shows models sometimes hedge between total vs per‑LED current and output both incorrect and correct answers. For the AI/ML community the results are a sober reminder that strong natural‑language reasoning performance doesn’t guarantee deterministic arithmetic reliability. ORCA exposes domain unevenness (e.g., DeepSeek scores 74% on conversions but 10.5% in Biology & Chemistry) and highlights the need for rigorous, leakage‑resistant benchmarks and better integration of exact calculators, symbolic solvers or tool use in LLM pipelines. The findings are a snapshot—models evolve—but they underscore ongoing gaps in trustworthy computational reasoning for scientific and engineering use cases.
Loading comments...
loading comments...