🤖 AI Summary
Together AI announced ATLAS (AdapTive-LeArning Speculator System), a runtime-adaptive speculative-decoding system in its Together Turbo inference suite that automatically learns from live traffic to boost LLM throughput without manual tuning. ATLAS combines a heavyweight static speculator (broad, stable coverage) with a lightweight adaptive speculator that updates from real-time usage and a confidence-aware controller that selects which speculator and how many tokens to draft. In benchmarks ATLAS reaches up to ~500 TPS on DeepSeek‑V3.1 and ~460 TPS on Kimi‑K2 (4× speed vs. an FP8 baseline and ~2.65× faster than standard decoding), even outperforming specialized hardware like Groq in their reported tests.
Technically, ATLAS advances speculative decoding by addressing the two key knobs that determine speed: acceptance rate (α) and the draft-versus-target latency ratio (c). The system raises α by continuously aligning lightweight drafts to evolving workloads and keeps c low with optimized speculator architectures (sparsity, quantization, KV reuse) and fast kernels. A built-in efficiency guardrail falls back to the static speculator when confidence or distribution drift is detected, preventing TPS collapse. The approach is especially valuable in serverless/multi-tenant settings and RL training (where rollouts shift policies and are a major time sink), since online adaptation preserves alignment and compounds with Turbo’s other optimizations for sustained throughput gains.
Loading comments...
login to comment
loading comments...
no comments yet