P99Conf 2025 Recap: Rust, LLMs, and the Art of shaving down latency (tanatloke.medium.com)

0 points 6 days ago ago | visit original

🤖 AI Summary

P99Conf 2025 was a deep dive into shaving latency across stacks — the dominant themes were a Rust wave for system rewrites, intense focus on LLM inference optimizations, database/storage internals, and low-overhead observability (eBPF). Several talks framed LLM performance around two metrics: Time to First Token (TTFT, responsiveness) and Time Per Output Token (TPOT, throughput). Practical levers covered included quantization and distillation to shrink models, batching and prefill/decode decoupling to separate parallel vs. sequential phases, and prompt caching to reuse shared token state. KV-cache offloading was a standout: store prefill KV matrices to disk (cheaper linear I/O) to avoid recomputing them (quadratic compute), with cache structures (e.g., radix trees) and eviction policies that prioritize retaining shared system-prompt blocks over ephemeral leaf answers. On systems engineering, Rust migrations delivered dramatic gains (Datadog’s Rust timeseries engine: 60× faster ingestion, 5× faster queries, 2× cost efficiency) by using per-shard single-threaded LSM designs and unified caching. Real-world memory fixes included Uber’s hybrid disk format (keys+offsets in RAM with an LRU hot-value cache) to avoid OOMs from atomic reloads. Pipeline and edge talks highlighted new serialization approaches (Imprint) and zero-copy formats (rkyv) to eliminate deserialize costs (Climatiq cut P99 to sub-10ms), while Go tuning advice stressed observability-first and knobs like GOMAXPROCS, GOGC, GOMEMLIMIT, and PGO. Together these sessions painted a practical playbook for anyone building low-latency ML services and high-throughput backends.

Loading comments...

loading comments...