🤖 AI Summary
Epoch AI’s independent evaluation of Google’s publicly available Gemini 2.5 “Deep Think” (a variation of the IMO gold model) tested the model on FrontierMath (350 short-answer problems across four tiers) and other research problems, plus hands-on trials by two professional mathematicians. Deep Think set a new record on FrontierMath Tiers 1–3 (29%) and scored 10% on Tier 4, showing notable gains in leveraging background knowledge and executing precise, involved computations. It matched top models as a research assistant in many cases, solved a previously-unsolved Tier 4 instance, and often takes a more conceptual (non-coordinate) approach to geometry problems. The evaluation was manual, used the Gemini app (with code-execution and web search), and notes the public model is not the exact IMO gold model.
However, important weaknesses remain: the model struggles with genuinely creative or intricate proofs, makes classical human-like reasoning errors on simple word problems, and frequently fabricates or misattributes bibliographic citations—often citing non-existent or incorrect papers (a major caution for researchers). Correlations between model success and problem traits show performance drops with higher demands for background (-0.22) and precision (-0.24), while creativity correlates weakly (-0.09). Overall, Deep Think advances computational and literature-aware capabilities but needs better bibliographic hygiene and creativity to be a fully reliable research collaborator.
Loading comments...
login to comment
loading comments...
no comments yet