🤖 AI Summary
At the AI Infrastructure Summit, NVIDIA showcased groundbreaking advancements in AI inference performance with its new Blackwell Ultra GB300 GPUs, shattering MLPerf benchmarks across the board. Key to this leap was both hardware innovation and novel software strategies, including extensive use of the NVFP4 numerical format that balances precision with reduced compute and memory demands, and sophisticated parallelism techniques—"expert parallelism" for Mixture of Experts models and "data parallelism" for attention mechanisms—optimized via an "Attention Data Parallelism Balance" method. NVIDIA further boosted throughput by introducing "Disaggregated Serving," splitting inference workloads between GPUs specialized for the compute-heavy "Context/Prefill" phase and the memory-bound "Decode/Generation" phase, resulting in a 1.5x per-GPU throughput gain and 5.4x overall performance compared to previous Hopper systems.
Building on these insights, NVIDIA unveiled the Rubin CPX GPU, designed specifically for massive-context inference, which prioritizes handling large contextual input with remarkable efficiency. Unlike the traditional Rubin GPU’s focus on memory bandwidth with HBM3e, Rubin CPX uses GDDR7 memory and integrates video encoding for generative video AI. It delivers up to 30 petaFLOPS in NVFP4 tensor compute, with exponent operations three times faster than Blackwell Ultra. NVIDIA also revealed new hardware configurations pairing Rubin and Rubin CPX GPUs in hybrid racks—such as the Vera Rubin NVL144 CPX and a dual-rack solution—that boost NVFP4 compute to over 8 exaFLOPS and memory capacity to 150 TB, promising unprecedented scale and speed for next-generation AI workloads. These systems are expected to be available by late next year, signaling a major step forward for large-context AI inference and generative model deployment.
Loading comments...
login to comment
loading comments...
no comments yet