🤖 AI Summary
DeepSeek released DeepSeek-V3.2-Exp, introducing DeepSeek Sparse Attention (DSA), a two-part attention design that speeds up transformer attention by cheaply identifying and then exploiting the most important token interactions. A tiny “Lightning Indexer” runs a light attention-like pass to build a binary mask that keeps only the top-k key–query interactions per query (each mask row has k entries). That indexer is O(n^2) in class but uses far fewer heads and lower-dimensional queries/keys, so its constant factors are much smaller. A larger Multi‑Latent Attention (MLA) layer then computes block outputs by only attending to those k masked entries per query, giving an effective compute of O(k·n) for the heavy work.
This matters because DSA reduces the real-world bottleneck in attention by reusing a cheap indicator of which matrix entries matter, rather than computing full dense attention everywhere. The idea is akin to KV-sharing tricks (YOCO, Multi‑Query) but distinct: it reuses an importance mask, not just KV pairs. Practically, DSA can improve throughput and memory efficiency for long-context models or large-capacity layers, while preserving key interactions; trade-offs include the indexer’s residual quadratic pass, the choice of k, and potential accuracy vs. sparsity tuning. The approach is promising for hardware-friendly, scalable transformer designs that focus expensive compute only where it matters.
Loading comments...
login to comment
loading comments...
no comments yet