🤖 AI Summary
Inference engines for generative LLMs face a core scheduling problem: GPUs prefer large, batched work but user requests arrive over time and grow token-by-token. Autoregressive models enable “continuous batching” (scheduling after every token), while non-autoregressive models must batch at request-level. Early fixed-rectangle batching wasted VRAM; paged attention and finer-grained KV allocation let systems pack many more sequences into memory, but now the scheduler must decide which requests to admit, how many tokens each request should advance per model forward, and which sequences to preempt when VRAM (and thus the token budget per forward pass) is exhausted.
vLLM’s scheduler exemplifies these design choices: a SchedulerInterface that, each iteration, outputs a map {req_id: num_tokens} under a global token_budget and supports “chunked prefilling” (mixing prompt prefills and single-token decodes in one heterogeneous batch). Scheduling reconsiders running requests first (capped by long_prefill_token_threshold), asks the kv_cache_manager to allocate blocks for growth, and preempts when allocation fails—using either priority-based eviction (drop lowest-priority, oldest) or FCFS (evict most-recent). Waiting requests are then considered if space remains, subject to max_num_running_reqs, prefix-caching, remote KV transfers, and lookahead/speculative decoding constraints. The result: rich trade-offs between memory packing, compute saturation, latency, fairness and throughput—making scheduler policy and KV-block management central levers for scalable, low-latency LLM serving.
Loading comments...
login to comment
loading comments...
no comments yet