PowerRetention: a drop-in replacement for FlashAttention in LLMs (github.com)

🤖 AI Summary
Manifest AI released PowerRetention, an open-source PyTorch layer that implements "power retention" — a linear-cost attention variant designed as a drop-in replacement for FlashAttention in LLMs. The key claim is that retention decouples state size from context length and parameter count, letting models maintain a fixed-size state (controlled by a power parameter p, called deg) instead of growing KV caches. In benchmarks (3B model on an A100, prefill 2048) PowerRetention-based models like PowerCoder achieve vastly higher token throughput for long-context generation because the algorithm scales O(t) with sequence length rather than O(t²) for standard attention. Technically, the repo provides an efficient chunked algorithm, gated attention and rotary embedding support, CUDA kernels optimized for A100, and FP16/BF16 precision. For inference there’s power_retention_inference which produces constant-time per-token generation by updating a compact state (batch x num_heads x D x head_dim) and a normalization sum_of_keys, rather than extending KV cache. Usage is straightforward (pip install retention, Python 3.11/3.12, CUDA 12.4, Linux), with examples showing chunk_size, deg, and switch_over_seq_len controls. The project includes training/inference examples, benchmarks, and developer tooling; it’s Apache‑2.0 licensed and citable as "Symmetric Power Transformers" (Buckman et al., 2024).
Loading comments...
loading comments...