An LLM verifier rated math proofs near-perfect; an expert found 17% correct (korbonits.com)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A recent study from MiniMax titled MaxProof reveals a stark gap between the confidence levels of AI math proof generative models and their actual correctness. The model, M3, achieved an impressive score of 35 out of 42 on the International Mathematical Olympiad (IMO) 2025 and 36 out of 42 on the USAMO 2026. However, an independent expert review showed that only 17% of these high-scoring proofs were indeed correct. Most troubling, as the paper highlighted, is the discrepancy where a model scored proofs a near-perfect 0.99 while a human judge rated them at just 0.55 on average. This finding raises significant concerns for the AI/ML community, emphasizing the challenge of distinguishing between convincing outputs and genuinely accurate results produced by models. It illustrates the risk of relying on AI-generated content, which may appear polished yet is fundamentally flawed. The MiniMax team acknowledges the limitations of their verifier even after implementing multiple defensive layers to reduce false positives, underscoring the importance of human oversight in validating these proofs. This critical analysis highlights the ongoing struggle in AI development: achieving reliable outputs while managing the costly and complex process of verification.

Loading comments...

loading comments...