InferenceMAX: Open-Source Inference Benchmarking (newsletter.semianalysis.com)

🤖 AI Summary
InferenceMAX™ is an open-source, automated inference benchmark that runs nightly across hundreds of accelerators to deliver a live, continually updated view of real-world LLM inference performance. It measures token throughput per GPU, latency (tok/s/user), performance-per-dollar, cost per million tokens, and tokens per provisioned megawatt, and currently evaluates models like DeepSeek R1 670B, GPTOSS 120B, and Llama3 70B on hardware including NVIDIA GB200 NVL72/B200/H100 and AMD MI355X/MI325X/MI300X. A public dashboard (inferencemax.ai) exposes head-to-head results across stacks such as SGLang, vLLM, TensorRT-LLM, CUDA and ROCm, capturing rapid software-driven gains (kernel-level optimizations, FP4, MTP, speculative decode, wide-EP, distributed strategies and scheduling) that make fixed-point benchmarks stale. For the AI/ML community this matters because inference performance is shaped as much by fast-evolving software as by new silicon; nightly rebenchmarks provide a neutral, reproducible feedback loop for researchers, cloud operators, and buyers to assess throughput, TCO and energy efficiency in near real-time. InferenceMAX’s vendor-neutral design highlights workload-specific tradeoffs (AMD vs NVIDIA strengths) and will expand to include Google TPU and AWS Trainium, promising the first multi-vendor open benchmark that tracks how low-level optimizations translate into tangible inference gains at scale.
Loading comments...
loading comments...