MathArena Apex: Unconquered Final-Answer Problems (matharena.ai)

🤖 AI Summary
MathArena announces the Apex set: 12 of the hardest 2025 final-answer math problems that remain unsolved by state-of-the-art LLMs under a practical “few attempts” metric. The move comes after LLMs (GPT-5 among them) began scoring ~90% on standard final-answer competitions, cracking formerly stubborn problems (e.g., AIME 2025 P15). To preserve a meaningful benchmark of model reasoning, MathArena filtered nearly 100 competitions, converting proof-based questions to final-answer format where possible and discarding any problem solved within 4 attempts by representative frontier models. The result: only 12 problems survived; under expanded testing (9 models × 16 attempts) some problems yielded occasional successes, but Problems 9–12 had zero successes. The best single model on the set, Qwen3, reached just 5.2% overall accuracy, and majority voting would not reliably recover correct answers. Technically, the organizers used a two-perspective solvability definition (pass@k for small k vs. large k), evaluated models including Grok 4, GPT-5 (High, with an iterative self-verification “Agent” elicitation), Gemini 2.5 Pro, and GLM 4.5, and carefully considered contamination from publicly available contest data. Their analysis reveals common failure modes: models often converge on the same wrong guesses, exhibit overconfidence and weak uncertainty quantification, produce hand-wavy pseudoproofs, and struggle to refine constructions. MathArena proposes three directions forward—manual proof evaluation, Project Euler–style coding+math tasks, or aggregating the hardest final-answer problems—to better probe current limits and encourage community contributions.
Loading comments...
loading comments...