Softmax-free ~354M: tile-skip kernels for long-context VRAM savings (sparse) (huggingface.co)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A team has announced the successful development of RRT-355M, a new model inspired by GPT-2 Medium that utilizes a softmax-free attention mechanism to enhance VRAM efficiency while maintaining performance on a standard 22-task in-context learning benchmark. Despite being approximately 354 million parameters, the model demonstrates similar capabilities to its softmax-utilizing counterparts, achieving a CORE score of 0.1558, which sits just below the dense GPT-2 Medium's score of 0.1770. The significance of this development lies in its potential to reduce memory usage during inference by applying structural sparsity, achieving up to 55% efficiency at longer context lengths. The model is accompanied by a Hugging Face repository that provides necessary weights and configurations, although it requires the custom RRT engine for inference. Key technical details include a training dataset of 11.534 billion tokens and a focus on sparsity, allowing the model to maintain nearly identical outputs to dense models while utilizing significantly less VRAM. Notably, while the RRT-355M shows gains in multiple-choice reasoning tasks, there are minor regressions in continuation tasks, pointing to a trade-off that AI/ML researchers must consider when integrating such models. The team has established that no further checkpoints will be released, emphasizing the model's proof-of-concept status.

Loading comments...

loading comments...