ZenMux-Benchmark, a dynamic AI model evaluation leaderboard (zenmux.ai)

🤖 AI Summary
ZenMux announced ZenMux-Benchmark, a dynamic, open-source leaderboard that systematically evaluates AI models across every provider channel available on the ZenMux platform. Unlike single-endpoint tests, each model is tested per provider channel (e.g., GPT-5 via OpenAI vs. Azure) to surface provider-specific differences in performance and stability. All test code, procedures, and raw results are public on GitHub, enabling reproducibility and community scrutiny. The project runs full-scale testing for each model to produce timely, comprehensive performance snapshots and aims to maintain a real-time, continuously updated leaderboard for model selection. Technically, ZenMux-Benchmark uses Scale AI’s Humanity’s Last Exam (Text Only) as its primary evaluation dataset—a widely recognized benchmark covering broad knowledge domains and reasoning tasks. Because some vendors’ content filters or constraints can prevent models from answering every question, the benchmark computes scores using the number of questions a model actually responded to as the denominator and applies cost normalization proportional to completion rate to keep cost-effectiveness comparisons fair. Detailed success rates and per-provider completion statuses are published with each release. The project invites community feedback to refine methodology, expand evaluation dimensions, and improve the leaderboard’s utility for researchers and practitioners.
Loading comments...
loading comments...