Arcee Trinity Large Technical Report (arxiv.org)

🤖 AI Summary
The recent release of the Arcee Trinity Large Technical Report introduces a groundbreaking sparse Mixture-of-Experts (MoE) model, boasting 400 billion parameters with 13 billion activated per token. Alongside it, the report details two smaller variants, Trinity Nano and Trinity Mini, which have 6 billion and 26 billion parameters, respectively. The models utilize an innovative architecture featuring interleaved local and global attention, gated attention mechanisms, and a novel MoE load balancing strategy called Soft-clamped Momentum Expert Bias Updates (SMEBU). Notably, all three models were trained without any loss spikes, indicating a significant advancement in training stability. This development is significant for the AI/ML community as it represents a substantial leap in model size and efficiency, especially with Trinity Nano and Trinity Mini being pre-trained on an extensive dataset of 10 trillion tokens, while Trinity Large was trained on a staggering 17 trillion tokens. The use of the Muon optimizer further enhances their performance capabilities. The availability of these model checkpoints provides researchers and developers access to cutting-edge resources, paving the way for new applications in natural language processing and beyond.
Loading comments...
loading comments...