LLM Inference with Ray: Expert parallelism and prefill/decode disaggregation (www.anyscale.com)

🤖 AI Summary
Ray announced new Ray Serve LLM APIs that make deploying advanced serving patterns for sparse mixture-of-experts (MoE) models—like DeepSeek and Qwen3—far easier. The release adds first-class support for wide expert parallelism (wide-EP) and disaggregated prefill/decode topologies, validated at high throughput (≈2.4k tps per H200 on Nebius with Infiniband). Example builders (e.g., build_dp_deployment, build_pd_openai_app) and an LLMConfig object let users compose data-parallel + expert-parallel groups and independent prefill/decode engines in plain Python, with vLLM optimizations available via env flags (VLLM_USE_DEEP_GEMM, VLLM_ALL2ALL_BACKEND=deepep_low_latency). Technically, wide-EP shards MoE experts across GPUs while duplicating attention layers and adding expert load balancing, replication, and optimized all-to-all kernels; the builder handles DP rank assignment, placement, and group formation (DPRankAssigner → DPServer → vLLM). Prefill/decode disaggregation separates prompt encoding (prefill) from token generation (decode) using a KV transfer connector (e.g., NixlConnector) and a PDProxyServer; prefill fills the KV cache (max_tokens=1) and passes KV metadata to the decode pool, with PrefixCacheAffinityRouter improving cache hit rates. The significance: Ray makes tightly coupled, stateful, topology-aware serving programmable (not just configurable), reducing operational complexity versus vanilla Kubernetes, enabling heterogeneous autoscaling, better SLAs, observability for KV transfer/speculative decoding, and vLLM-equivalent performance with dynamic, fault-tolerant orchestration.
Loading comments...
loading comments...