DeepSeek releases ‘sparse attention’ model that cuts API costs in half (techcrunch.com)

🤖 AI Summary
DeepSeek on Monday published an experimental model, V3.2-exp, that implements a new “DeepSeek Sparse Attention” mechanism to slash inference costs for long-context tasks. The architecture uses a two-stage selection pipeline: a “lightning indexer” first prioritizes relevant excerpts from the full context window, then a “fine-grained token selection” module picks specific tokens within those excerpts to feed a constrained attention window. By only loading salient tokens into the expensive attention computation, the model keeps server loads low while still operating over long documents. DeepSeek released the model weights on Hugging Face and a linked paper on GitHub for independent evaluation. The practical upshot is potentially large: DeepSeek’s preliminary tests report API call prices cut by up to ~50% in long-context scenarios. Because the weights are open, third-party benchmarks can quickly verify and stress-test the claim. Technically, this is another example of optimizing transformer inference (not training) by sparsifying attention rather than changing base architectures, and it may influence how providers handle long-context applications (chat, retrieval-augmented generation, document understanding). Coming from DeepSeek — already notable for its low-cost R1 work — the release is unlikely to be revolutionary by itself but offers actionable techniques that could materially reduce production inference costs across the industry.
Loading comments...
loading comments...