Multi-Head Latent Attention (sebastianraschka.com)

🤖 AI Summary
Multi-Head Latent Attention (MLA) is a memory-saving attention variant used in DeepSeek V2/V3/R1 that compresses keys and values into a lower-dimensional latent space before writing them to the KV cache, then projects them back at inference. Unlike Grouped-Query Attention (GQA), which reduces the number of KV heads, MLA reduces KV storage size directly by storing a latent representation. The technique pairs naturally with KV caching for long-context inference and — according to ablations cited by the DeepSeek authors — can slightly outperform standard multi-head attention (MHA) in modeling quality, which helps explain its adoption over GQA. Technically, KV cache bytes scale roughly with batch_size × seqlen × n_layers × latent_dim for MLA versus batch_size × seqlen × n_layers × 2×embed_dim for MHA (K and V). That compression can yield large savings: e.g., compressing embed_dim 2048 → latent_dim 1024 gave ~75% KV cache reduction (≈4× smaller) in the provided estimator. Practical tradeoffs include one extra projection at inference (and queries are only compressed during training), and the compression hyperparameter must be tuned—too small a latent_dim hurts performance. Example runs show memory dropping from ~1.54 GB to ~0.68 GB with modest impact on throughput. Code, a memory estimator, and an MLA reference implementation (inspired by Hugging Face’s deepseek-mla) are provided for experimenting and integrating MLA as a drop-in replacement in GPT-style models.
Loading comments...
loading comments...