Show HN: KV Marketplace – share LLM attention caches across GPUs like memcached (github.com)

0 points 9 hours ago ago | visit original

🤖 AI Summary

KV Marketplace is a new node-local runtime that treats transformer attention key/value (KV) caches as shareable artifacts so different GPU processes can reuse completed prefix states instead of recomputing them. Integrated as a plugin/hooks (before_prefill, after_prefill) for a vLLM fork (neelsomani/vllm, branch vllm-kvm-dev), it exports per-layer KV tensors indexed by a hash of the token sequence+model version into a registry and imports matching prefixes via direct GPU-to-GPU transfers (CUDA peer-to-peer / NVLink / RDMA), bypassing host memory. The code includes a CUDA transport extension, demo scripts, unit/integration tests (including P2P copies and numeric fidelity checks), and example vLLM benchmarking that demonstrates improved throughput/latency in common multi-request scenarios. For ML infra and model-serving teams this is significant because autoregressive decoding frequently repeats prefix computation across tenants or user queries; sharing KV caches can substantially reduce redundant compute and GPU memory use for chat, RAG, and multi-tenant serving. Key technical notes and current limits: exact-match prefix reuse only (no longest-common-prefix partial reuse yet), node-local only (no cross-host registry), no sharded tensor import, no compression/quantization, and no advanced eviction/placement policies in the MVP. Requirements include CUDA 12.8, PyTorch 2.0+, and GPUs that support peer-to-peer access; future work could add distributed registries, LCP matching, cache-aware load balancing and consistency models.

Loading comments...

loading comments...