LLM Engine Orchestration for Performance (www.anyscale.com)

0 points 6 hours ago ago | visit original

🤖 AI Summary

Ray Serve (Ray 2.49) now supports custom request routing and ships a prefix-aware router, PrefixCacheAffinityRouter, designed to optimize LLM inference. Instead of the default “Power of Two Choices” load balancer, the new router keeps a lightweight, character-level prefix tree that approximates each replica’s vLLM KV-prefix cache and routes new requests to the replica with the longest common prefix. This preserves hot prefix cache hits—critical for multi-turn chat, agent/tool loops and large MoE models using DP+EP sharding—so requests avoid repeated prefill work. The custom router API is currently alpha/experimental and falls back to power-of-two routing when matches or load balance are poor. Benchmarks show substantive gains: on a 32B DeepSeek-R1 model across 64 GPUs, PrefixCacheAffinityRouter delivered ~60% reduction in time-to-first-token and >40% end-to-end throughput improvement (and up to ~2.5x input-token processing gains in some tests). The team used a vLLM PrefixRepetitionDataset to control shared prefix frequency (e.g., 512-token prefixes + 128-token unique suffixes), scaled concurrency with replicas, and ran reproduction scripts publicly. The design trades off some precision for low overhead (character-level approximation vs. explicit KV-cache events), but yields large practical latency/efficiency wins for real-world LLM serving scenarios.

Loading comments...

loading comments...