🤖 AI Summary
Manifest AI open-sourced "power retention," a drop-in replacement for transformer attention that dramatically improves long-context efficiency without sacrificing scalability or GPU utilization. Replacing flash_attention(q,k,v) with power_retention(q,k,v) is claimed to yield >10× speedups in training and >100× in inference at 64k-token contexts (with larger gains at longer contexts). The library is pip-installable (pip install retention) and the repo is public. They also show that pretrained transformers need only small amounts of retraining to convert — e.g., StarCoder2-3B was retrained into PowerCoder-3B and released on Hugging Face — making adoption straightforward for existing models.
Technically, the gains come from a hardware-aware CUDA implementation that keeps GPU utilization comparable to FlashAttention while changing the attention algorithm to be far more efficient for long sequences. The release also includes Vidrial, an open framework for clean, high-performance CUDA kernels accessible via retention.experimental; their Vidrial implementation of FlashAttention2 is up to 20% faster than current options. For the AI/ML community this lowers the compute and latency barriers to genuinely long-context models (document-level reasoning, lifelong memory, video/sequence modeling), enabling cheaper inference and easier retrofitting of existing models into long-context architectures.
Loading comments...
login to comment
loading comments...
no comments yet