How fast can an LLM go? (fergusfinn.com)

🤖 AI Summary
A new deep-dive uses a roofline-style analysis plus real inferenceMAX benchmarks to answer “How fast can an LLM go?” The core observation: transformer FLOPs are overwhelmingly dominated by matrix multiplies, so end-to-end performance is set by the ratio of arithmetic work to memory traffic (arithmetic intensity) relative to an accelerator’s compute-to-bandwidth ratio. Practically this yields two regimes: prefill (processing the prompt and building the KV cache) is compute‑bound on modern GPUs, while decode (one token per step) is usually memory‑bandwidth bound because each step must stream model weights and the growing KV cache. The article shows how the threshold token/batch sizes that flip between compute vs bandwidth bound depend on accelerator specs (e.g., H100 FP8 peak TFLOPS and TB/s) and that attention FLOPs matter more at prefill than decode. Benchmarks find real systems reach roughly 20–50% of theoretical peak; gaps come from comms/synchronization, kernel and scheduling overheads, imperfect overlap of transfer+compute, and extra memory movement for optimizations like chunked prefilling. Chunked prefilling (heterogeneous batches that interleave prefill and ongoing decodes) recovers some spare compute and raises throughput. Further gains are possible via speculative decoding, disaggregated prefilling (separate hardware for prefill vs decode), MoE/sparse models (active‑parameter figure of merit), and architectural changes (flash/linear attention and better low‑level kernels). The piece highlights that much of the remaining upside is software and system design rather than raw hardware capability.
Loading comments...
loading comments...