MoonshotAI Kimi-Linear (github.com)

🤖 AI Summary
MoonshotAI today announced Kimi Linear, a hybrid linear-attention architecture centered on a new Kimi Delta Attention (KDA) kernel — a refined, fine-grained-gated version of the gated delta rule — and released the KDA kernel in FLA plus two 48B checkpoints (3B activated params, 1M context) trained on ~5.7T tokens. The hybrid design uses a 3:1 ratio of KDA to global MLA blocks to cut KV cache needs up to ~75% while retaining or exceeding full-attention quality. Empirically, Kimi Linear matches full-attention speed on short-context MMLU-Pro (4k) with 51.0, is Pareto-optimal on long-context RULER (128k) with 84.3 and a 3.98× speedup, and gives up to ~6× faster decoding and a 6.3× faster TPOT versus MLA at extreme lengths (1M tokens). For the AI/ML community this matters because it offers a practical path to million-token contexts and RL-style scaling without the memory and latency penalties of dense attention. KDA leverages finite-state RNN memory and efficient gating to provide hardware-friendly linear attention that preserves expressivity, reducing per-token compute and memory overheads for deployment. MoonshotAI provides Hugging Face checkpoints, example usage (Transformers/vLLM), and a FLA kernel, making it straightforward to experiment with long-context LLMs that require high throughput and low KV-cache footprint.
Loading comments...
loading comments...