🤖 AI Summary
NVIDIA published an updated set of data‑center inference benchmarks and a position paper arguing that inference metrics must go beyond raw latency to include throughput, energy efficiency and cost-per-token as AI shifts from one-shot answers to multi-step, token-heavy reasoning. Using MLPerf v5.1 closed-division results and broader internal tests, NVIDIA highlights Blackwell-era systems (GB300/GB200) and B200/H200 platforms with extreme hardware–software co‑design and continuous software tuning to move the Pareto frontier—trading off cost, tokens/sec, tokens/watt and responsiveness depending on production priorities.
Key technical takeaways: MLPerf offline and server scenarios show GB300 clusters delivering top aggregate throughput (e.g., DeepSeek R1 at ~420k tokens/s on 72× GB300; Llama3.1 405B ~16k tokens/s on 72× GB300), while B200 setups excel on many practical models (Llama3.1 8B ~147k tokens/s on 8× B200; Llama2 70B up to ~103k tokens/s on 8× B200 offline). Other highlights include Whisper (up to ~45k samples/s on 8× B200), Stable Diffusion XL (~33 queries/s on 8× B200) and Qwen3 235B at ~66k output tokens/s using FP4 with TensorRT‑LLM. Results span FP16/FP32/FP4 precisions, TensorRT stacks, and MLPerf latency targets—underscoring that full‑stack optimization and precision-aware inference yield substantial gains in throughput, energy efficiency and lower token economics for large-scale AI deployments.
Loading comments...
login to comment
loading comments...
no comments yet