Disaggregated Inference at Scale with PyTorch and VLLM (pytorch.org)

0 points 3 days ago ago | visit original

🤖 AI Summary

Meta and the vLLM team announced a production-grade implementation of prefill/decode (P/D) disaggregation integrated with PyTorch and vLLM to scale generative inference. Prefill/Decode disagg splits the first-token “prefill” work (compute-heavy, happens once per request) from the autoregressive “decode” work (memory/bandwidth-bound, dominates latency), letting prefill and decode hosts scale independently. Meta’s implementation — comprised of a service proxy, an async Python KV connector and high-performance C++ KV connectors over TCP — is already serving large internal traffic and will feed optimizations back upstream to vLLM. Technically, the stack focuses on parallel, low-latency KV-cache transfer: multi-NIC and multi-stream TCP transfers, sticky routing for session affinity (yielding ~40–50% prefix cache hit rate while keeping HBM ~90% utilized), larger KV block sizes (128–256 tokens vs vLLM’s default 16) to reduce kernel overhead, non-blocking CUDA streams, per-layer sequential KV injection, and avoiding heavyweight Python objects during scheduling. Benchmarks with Llama4 Maverick on 8x H100 hosts (2,000-token input / 150-token output) show 1P1D disagg improves throughput at fixed batch sizes and produces smoother TTIT under fixed QPS, though TTFT can regress at extreme load (network bottlenecks, prefill pressure). Next steps include cache-miss-only transfers and further compute/communication overlap to tighten TTFT/TTIT tradeoffs.

Loading comments...

loading comments...