🤖 AI Summary
Paged Attention is a memory-management scheme for LLM KV caches that packs per-request key/value vectors into fixed-size blocks with an indirection table instead of reserving one contiguous slab per sequence. That change dramatically reduces internal and external fragmentation, lets GPUs reuse freed blocks, and raises effective batch sizes under heterogeneous workloads—while preserving the fundamental O(T) streaming cost of decoding. The analysis shows why this matters in practice: with d=4096 and FP16 (2 bytes) each layer stores about 16 KB per token, so a 1,000-token prompt uses ~16·L_l MB of KV cache per request, and long contexts quickly exhaust HBM and throttle throughput.
The paper gives a simple roofline-style timing model: per decoded token the HBM traffic is ≈2 d T s L_l bytes, so tokens/sec is bounded by BW/(2 d T s L_l); decode time is modeled as t_decode(T,B) ≈ (2 d T s L_l)/BW + c_blk·(T/B) + c0. The middle term captures gather/indirection overhead from touching ≈T/B blocks and motivates a block-size tradeoff: smaller B reduces memory waste and increases batch capacity, larger B reduces per-block overhead. Optimizing J(B)=αT + β(T/B) + λB gives B* ≈ sqrt(βT/λ), meaning optimal block size grows with context length but shrinks under tight memory pressure. In short, paged allocation doesn’t change asymptotic decode cost but makes memory use elastic, enabling higher utilization and steadier latency with modest block-size tuning.
Loading comments...
login to comment
loading comments...
no comments yet