Moe at Scale: Making Sparse Models Fast on Real Hardware (www.cerebras.ai)

🤖 AI Summary
The piece explains practical bottlenecks when scaling Mixture-of-Experts (MoE) models and presents two complementary solutions: hardware-level memory handling on Cerebras Wafer Scale Engine (WSE) and a software technique called Batch Tiling on Attention (BTA). On GPUs, MoE scaling forces either loading all experts into memory or using expert parallelism (EP), which adds costly all‑to‑all communication, complex 3D parallelism, and a tension between load-balanced routing (good for hardware) and specialized routing (good for model quality). Cerebras avoids most model parallelism by using vastly more on‑chip SRAM and weight‑streaming (external memory → wafer), enabling training of much larger MoE parameter counts (up to ~1B on‑chip and weight‑streaming to support trillion‑parameter scales). However, WSE exposes a different problem: sparse routing starves compute because experts see tiny per‑expert batch sizes while attention layers are activation‑memory bound. BTA fixes this by tiling attention along the batch dimension: process G tiles of size B through attention, concatenate to form a large G×B batch for experts, decoupling batch-size constraints. On Qwen3 (3B active params, 128 experts, top_k=8), conventional batching lost up to 53% throughput when scaling experts and up to 86% when increasing sparsity; with BTA throughput stays close to the dense baseline. Implications: BTA converts theoretical FLOP savings into wall‑clock speedups on hardware like WSE, reduces reliance on costly EP on GPUs, and helps maintain both utilization and model quality when scaling MoEs.
Loading comments...
loading comments...