Towards Greater Leverage: Scaling Laws for Efficient Moe Language Models (arxiv.org)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers introduce Efficiency Leverage (EL), a new metric and unified scaling law that predict the computational advantage of Mixture-of-Experts (MoE) language models over dense equivalents. To tackle the long-standing challenge of estimating MoE capacity given configurations like expert activation and granularity, the team trained over 300 models (up to 28B parameters) and found that EL is primarily governed by the expert activation ratio (the fraction of experts used per token) and total compute budget, both following predictable power laws. Expert granularity—the size and number of experts—acts as a nonlinear modulator with a clear optimal range rather than a simple linear effect. They validated the scaling law by building Ling-mini-beta (0.85B active parameters) and training it on the same 1T high-quality-token corpus as a 6.1B dense baseline; Ling-mini-beta matched the dense model’s performance while using >7× less compute. Practically, this gives model designers a principled, empirically grounded way to choose activation ratios, granularity, and budgets to maximize efficiency, enabling much cheaper scaling of LLMs. The work promises more predictable MoE architecture tuning, lower training costs for comparable performance, and clearer trade-offs for deploying sparse-expert models at scale.

Loading comments...

loading comments...