DeepSeek Sparse Attention: Boosting Long-Context Efficiency [pdf] (github.com)

🤖 AI Summary
DeepSeek introduces a practical sparse-attention architecture aimed at making very long-context transformers far more efficient for both training and inference. The paper (and accompanying code) proposes a hybrid attention pattern that mixes local windowed attention, multi-scale dilated windows, and a learned routing/top‑k selection mechanism so each query only attends to a small, content-relevant subset of keys. That design reduces the effective quadratic cost of vanilla attention toward near-linear scaling in sequence length while remaining compatible with standard transformer weights and GPU kernels. Technically, DeepSeek emphasizes dynamic key chunking, hierarchical index-like routing, and concentration-aware normalization to preserve expressivity when most connections are pruned. The method is implemented with efficient memory mapping and custom kernels so it can be dropped into existing models without full re‑training. Empirical results reported show substantial speed and memory gains on long-document tasks (search/QA, code, retrieval-augmented generation) with minimal hit to downstream accuracy, making it a pragmatic option for developers who need longer contexts (thousands to potentially millions of tokens) without the cost of dense attention.
Loading comments...
loading comments...