🤖 AI Summary
Researchers propose a simple, actionable metric for agent ability: the human time length of tasks an AI can autonomously complete. They measure success probability across diverse multi-step software and reasoning tasks by fitting logistic curves that predict model success from the time experts take to finish the same task. Today’s top models (e.g., Claude 3.7 Sonnet) nearly always succeed on tasks that take humans under ~4 minutes, but fail most tasks that take humans >4 hours. Using the time where a model’s success curve crosses a fixed probability (e.g., 50%) yields a single “task-length capability” number per model.
Across six years of data this metric grows exponentially: the length of tasks a state-of-the-art model can complete doubles roughly every seven months (with sensitivity analyses showing 1–4 doublings/year and an alternate dataset suggesting even faster doubling). That implies generalist agents could autonomously handle week- to month-long projects within this decade if trends continue. The approach links benchmark performance to real-world impact, improving forecasts and risk assessment, but the authors flag methodological sensitivities (task selection, human time estimates) and emphasize robustness checks. They’ve open-sourced data and code to encourage replication and refinement.
Loading comments...
login to comment
loading comments...
no comments yet