AI evals are becoming the new compute bottleneck (huggingface.co)

šŸ¤– AI Summary
AI evaluation costs are escalating, creating significant barriers for developers in the AI/ML community. The Holistic Agent Leaderboard (HAL) recently incurred around $40,000 for executing 21,730 agent rollouts across nine benchmarks, highlighting the soaring expenses associated with both static and agent evaluations. Various studies indicate that costs can vary widely due to factors such as scaffold choices, leading to disparities of up to 200 times in compute requirements for similar tasks. For instance, the GAIA evaluation can cost $2,829 per run, while an analysis revealed that running identical benchmarks might lead to spending substantially more for marginal improvements in accuracy. The implications of these rising costs are profound, as they threaten to limit access to effective evaluation strategies primarily to those with substantial resources. Static benchmarks have shown that cost reductions are possible—reducing compute by up to 200 times while maintaining rankings—but this progress significantly diminishes with more complex agent evaluations, which are inherently noisier and less compressible. Furthermore, the move from traditional evaluations to agent-based assessments often shifts the cost burden towards evaluation rather than training, fundamentally reversing previous assumptions about the cost structure of machine learning. As the AI/ML community grapples with these challenges, developing more cost-effective and reliable evaluation mechanisms will be crucial for broadening participation and advancing innovation in the field.
Loading comments...
loading comments...