🤖 AI Summary
METR has announced the release of Time Horizon 1.1 (TH1.1), which includes a significant update to their task suite and evaluation infrastructure, increasing the number of tasks from 170 to 228—an increase of 34%. This new version refreshes time horizon estimates for 14 AI models, revealing that while most estimates remain within prior confidence intervals, they show a faster growth trend in capabilities. The new task suite incorporates more complex problems, including a rise in long-duration tasks, refining the accuracy and tightness of their estimates, which are crucial for assessing AI models’ autonomous capabilities.
The shift to the Inspect evaluation framework, a widely-utilized open-source platform, marks another pivotal change aimed at enhancing the evaluation of AI capabilities. This transition, along with the expanded task suite, has led to notable changes in time horizon estimates—like a 55% increase for GPT-5. Additionally, the post-2023 doubling time for model capabilities has decreased from 165 days to 131 days under TH1.1, highlighting a more rapid advancement in AI capabilities. METR emphasizes the importance of a well-defined task distribution for accurate performance measurement, and they are committed to continuing improvements in evaluating emerging AI technologies.
Loading comments...
login to comment
loading comments...
no comments yet