Intro to Routing: Mixture-of-Experts and Expert Choice (www.neelsomaniblog.com)

🤖 AI Summary
This piece derives Mixture-of-Experts (MoE) and Expert Choice (EC) routing from first principles, explaining why common engineering choices arise and what their trade-offs are. For MoE the router computes logits z and softmax probabilities g over N experts, producing a convex combination of expert outputs; but full evaluation of every expert is expensive. Top-1 (or Top-K) gating runs only the highest-scoring experts, which makes unused experts receive zero gradient and leads to training collapse where a few experts dominate. To mitigate this, the author derives expected per-expert loads using the Gumbel-max trick so you can penalize imbalance (e.g., KL, L2, entropy, or coefficient-of-variation losses). In practice a simpler surrogate auxiliary loss—built from empirical load approximations using soft probabilities—is widely used despite being less principled. Expert Choice flips the paradigm: experts pick the M tokens they want to serve, guaranteeing a fixed budget per expert and more predictable latency and utilization. Each expert evaluates gi(x) for all tokens and selects its top-M; gradients flow only for tokens each expert actually processes, so you avoid differentiating through the discrete Top-M operator. This reduces hot spots and overload but requires fallbacks for tokens selected by no expert. The post also notes Top-K sampling’s Plackett–Luce complexity, Shazeer et al.’s alternative noise/renormalization approach, and points to harder routing variants like Mixture-of-Depths where layer locations are dynamic, complicating the notion of a fixed expert function.
Loading comments...
loading comments...