🤖 AI Summary
Researchers presented a hardware-agnostic limit study of transformer LLM inference that isolates the core system bottlenecks: memory bandwidth, memory capacity, and synchronization overhead during distributed auto-regressive decoding. Using a performance model that spans HBM3→HBM4, advanced 3D-stacked DRAM, SRAM designs, and scaling from multi-chip clusters to wafer-scale, they show that serving a single model instance typically needs hundreds of GB of server memory, and that high memory bandwidth is essential to achieve high per-user throughput. Crucially, collective communication latencies must be on the order of ~1 µs or lower — otherwise synchronization stalls negate available bandwidth.
The paper also finds DRAM-based designs are fundamentally more efficient in throughput per dollar and per watt than many SRAM-centric alternatives, and that current and near-future hardware can readily hit 2,000+ user tokens/sec but reaching 10,000+ tokens/sec will require smaller models, shorter context windows, or algorithmic breakthroughs (e.g., sparsity, batching/parallel decoding, or new decoding algorithms). These quantified trade-offs give practitioners concrete levers for deployment (add memory capacity, invest in bandwidth and low-latency interconnects) and guide hardware designers toward the most impactful directions for scaling LLM inference.
Loading comments...
login to comment
loading comments...
no comments yet