Evals in 2025: benchmarks to build models people can use (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

The piece argues that 2025 evaluations should prioritize building “models people can use” — practical, reliable assistants — rather than chasing abstract notions of general intelligence. Citing Anthropic and OpenAI usage reports showing LLMs are most often used as assistants (coding, admin, agentic workflows), it proposes a multi-layered evaluation strategy: (1) test specific capabilities during development, (2) measure integrated performance on realistic tasks, and (3) probe adaptability in dynamic environments. The goal is to reward models that manage ambiguity, construct and execute stepwise plans, call tools correctly, handle long context, do math and code reliably, and avoid hallucination. Technically, the write-up emphasizes that these behaviors require combining reasoning, long-memory management, low hallucination rates, tool-use and robustness under unexpected events — and that reasonably small models (≈7B) can already serve as effective agents, with a performance cliff observed below ≈3B. It warns against over-reliance on saturated/contaminated benchmarks like MMLU and recommends richer evaluation modalities (game-based tasks for planning/adaptation, forecasting for calibration, and integrated assistant workflows) to better capture real-world usefulness, reliability, and safety trade-offs when deploying models.

Loading comments...

loading comments...