HLE-Verified: A Verification and Revision of Humanity's Last Exam (arxiv.org)

🤖 AI Summary
The announcement of HLE-Verified, a refined version of the benchmark Humanity's Last Exam (HLE), addresses critical concerns in the AI/ML community regarding evaluation accuracy of large language models. Previous analyses highlighted that HLE contained numerous noisy items that could skew evaluation results. HLE-Verified introduces a rigorous two-stage validation and revision process, resulting in 641 verified items and 1,170 revised-and-certified items, while identifying 689 items as uncertain for future refinement. This meticulous approach enhances transparency and reliability in benchmarking. Significantly, the implementation of HLE-Verified has led to an average accuracy improvement of 7-10 percentage points across seven state-of-the-art models, with gains of 30-40 percentage points on items previously marred by errors. This correlation between model confidence and the corrected items demonstrates the system's effectiveness, minimizing annotation noise and facilitating a more accurate assessment of model capabilities. As HLE-Verified sets a new standard in evaluation, its systematically addressed nuances are poised to enhance future benchmarks in the rapidly evolving field of AI and ML.
Loading comments...
loading comments...