Brumby-14B-Base: The Strongest Attention-Free Base Model (manifestai.com)

🤖 AI Summary
Researchers released Brumby-14B-Base, a 14B-parameter, attention-free LLM that replaces Transformer attention with a novel "power retention" layer and achieves competitive performance with state-of-the-art models. Trained for 60 hours on 32 H100 GPUs with a reported budget of $4,000 (vs. ~\$200k typical for from-scratch models of this scale), the team used a "retraining" approach that initializes power-retention weights from a pretrained Transformer (Qwen3-14B-Base). After ~3k steps it matches Qwen3’s training loss on the same data and shows comparable downstream evaluation behavior. The model and hardware-efficient kernels are available on Hugging Face and via pip install retention. Technically, power retention is a true RNN retention layer with inputs Q,K,V ∈ R^{t×d}, gating g ∈ R^t and a state S ∈ R^{d×D} updated as S_t = g_t S_{t-1} + V_t φ_p(K_t)^T and output Y_t = S_t Q_t, where φ_p is a tensor-power embedding; p (they used p=2) controls state dimensionality D. This formulation both yields an attention-equivalent form (useful for efficient implementation) and enables arbitrarily long-range influence with far lower memory and much faster long-context inference (hundreds× speedup claimed). Planned releases include long-context SFT tools (context up to 1,000,000 tokens), VLLM integration for faster, lower-memory inference, and a family of power-retention base models from 1B to >100B parameters.
Loading comments...
loading comments...