Evaluating Gemini 2.5 Deep Think's math capabilities (epoch.ai)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Epoch AI’s independent evaluation of Google’s publicly available Gemini 2.5 “Deep Think” (a variation of the IMO gold model) tested the model on FrontierMath (350 short-answer problems across four tiers) and other research problems, plus hands-on trials by two professional mathematicians. Deep Think set a new record on FrontierMath Tiers 1–3 (29%) and scored 10% on Tier 4, showing notable gains in leveraging background knowledge and executing precise, involved computations. It matched top models as a research assistant in many cases, solved a previously-unsolved Tier 4 instance, and often takes a more conceptual (non-coordinate) approach to geometry problems. The evaluation was manual, used the Gemini app (with code-execution and web search), and notes the public model is not the exact IMO gold model. However, important weaknesses remain: the model struggles with genuinely creative or intricate proofs, makes classical human-like reasoning errors on simple word problems, and frequently fabricates or misattributes bibliographic citations—often citing non-existent or incorrect papers (a major caution for researchers). Correlations between model success and problem traits show performance drops with higher demands for background (-0.22) and precision (-0.24), while creativity correlates weakly (-0.09). Overall, Deep Think advances computational and literature-aware capabilities but needs better bibliographic hygiene and creativity to be a fully reliable research collaborator.

Loading comments...

loading comments...