AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference (www.together.ai)

0 points 20 hours ago ago | visit original

🤖 AI Summary

Together AI announced ATLAS (AdapTive-LeArning Speculator System), a runtime-adaptive speculative-decoding system in its Together Turbo inference suite that automatically learns from live traffic to boost LLM throughput without manual tuning. ATLAS combines a heavyweight static speculator (broad, stable coverage) with a lightweight adaptive speculator that updates from real-time usage and a confidence-aware controller that selects which speculator and how many tokens to draft. In benchmarks ATLAS reaches up to ~500 TPS on DeepSeek‑V3.1 and ~460 TPS on Kimi‑K2 (4× speed vs. an FP8 baseline and ~2.65× faster than standard decoding), even outperforming specialized hardware like Groq in their reported tests. Technically, ATLAS advances speculative decoding by addressing the two key knobs that determine speed: acceptance rate (α) and the draft-versus-target latency ratio (c). The system raises α by continuously aligning lightweight drafts to evolving workloads and keeps c low with optimized speculator architectures (sparsity, quantization, KV reuse) and fast kernels. A built-in efficiency guardrail falls back to the static speculator when confidence or distribution drift is detected, preventing TPS collapse. The approach is especially valuable in serverless/multi-tenant settings and RL training (where rollouts shift policies and are a major time sink), since online adaptation preserves alignment and compounds with Turbo’s other optimizations for sustained throughput gains.

Loading comments...

loading comments...