Model Flop Utilization Beyond 6ND (jott.live)

🤖 AI Summary
Recent analysis argues that the industry’s go-to model-flop-utilization (MFU) heuristic—often summarized as inference F_i = 2·N·D and training F_t = 3·F_i (the “6ND” rule)—is increasingly misleading for modern workloads. The piece walks through why the three core assumptions behind 6ND (one multiply-add per parameter, backward = 2x forward, and compute-bound execution) break down: attention costs grow nonlinearly (O(D^2) or hybrid local patterns), FFNs can dominate even at modest sequence lengths, Mixture-of-Experts means “active” parameters ≠ total parameters, diverse parallelism types complicate per-GPU counts, KV-caches shorten effective D, continuous asynchronous batching makes MFU highly dynamic, and tricks like speculative decoding inflate flops and can produce apparent MFU >100% if you don’t account for wasted work. To address this, the vLLM PR offers two MFU approaches. A fast, pragmatic method computes active parameter counts (adjusting MoE layers by an experts-per-token sparsity factor) to get a quick per-step MFU. A more thorough approach leverages torch.compile execution graphs to enumerate per-node flops and bytes read/written, enabling per-op roofline analysis to decide whether an op is compute- or memory-bound. The takeaway: MFU is a useful monotonic signal but must be measured per-module/operation and combined with roofline-style bandwidth checks to meaningfully guide optimization on modern AI stacks.
Loading comments...
loading comments...