ARC-AGI-2 human baseline surpassed (www.lesswrong.com)

0 points 139 days ago ago | visit original

🤖 AI Summary

The ARC Prize has officially announced the release of ARC-AGI-2, a new benchmark designed to assess advanced AI reasoning systems. In a notable development, AI models like GPT-5.2 and Gemini 3 Pro have surpassed the human baseline established by the benchmark, marking a significant milestone in AI performance. Interestingly, while earlier benchmarks indicated human participants achieving scores close to 100%, the new scoring system reflects a nuanced interpretation. ARC-AGI-2 requires that at least two humans solve tasks, suggesting a more challenging framework for both AI systems and human participants. Despite this, the benchmark's scores prompt questions about the efficacy of AI compared to human intelligence—especially considering the average human performance is believed to be below 50%. This advancement holds critical implications for the AI/ML community as it underscores the rapid progression of AI capabilities, emphasizing not only the technical achievements but also the cost-efficiency of AI systems. GPT-5.2 accomplished a score of 52.1% at a cost of $1.90 per task, starkly contrasting with the $5 paid to human participants. The results further highlight the evolving landscape of AI, where models are not only refining their reasoning abilities but also outperforming human counterparts in both efficiency and accuracy. As ARC Prize plans to release comprehensive human performance data alongside ARC-AGI-2, these benchmarks will serve as essential tools for ongoing research and development in the field.

Loading comments...

loading comments...