Mellum2 Technical Report (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

The Mellum 2 Technical Report introduces an innovative 12B-parameter Mixture-of-Experts (MoE) language model designed specifically for software engineering applications. With 2.5B active parameters per token, Mellum 2 excels in tasks such as code generation, debugging, multi-step reasoning, and interactive programming assistance. This new architecture enhances its predecessor, the 4B dense Mellum model, through a combination of techniques including Grouped-Query Attention and a Multi-Token Prediction head. The model was pre-trained on approximately 10.6 trillion tokens and refined using advanced training methods that significantly boost efficiency, all while maintaining a lower compute cost. The significance of Mellum 2 for the AI/ML community lies in its competitive performance against other models in the 4B-14B range while being cost-effective, operating at the per-token compute level of a 2.5B dense model. By incorporating features like a 128K context window and remarkable architectures like YaRN, the model supports extensive reasoning and tool use, benefiting programmers and researchers alike. Alongside its release, various checkpoints and comprehensive documentation of its architecture, training procedures, and improvements are made available under the Apache 2.0 license, promoting open collaboration and further innovation in AI-assisted software development.

Loading comments...

loading comments...