🤖 AI Summary
Researchers used an automated discovery system (OpenEvolve, an ADRS-style framework) to evolve a new Expert Parallelism Load Balancer (EPLB) for Mixture-of-Experts (MoE) LLM inference that matches existing load balance quality while cutting rebalancing latency dramatically. Starting from a greedy Python baseline (≈540 ms) and a fast internal reference (19.6 ms), the evolved policy reduces runtime to 3.7 ms — a 5.0× speedup over the reference — while keeping the same load balance factor. The search ran ~300 iterations over five hours for under $10 using a PyTorch MoE simulator and workloads modeled from ShareGPT and GSM8K.
Technically, the breakthrough combines two classes of improvements: engineering (replacing Python for-loops with batched PyTorch tensor ops) and algorithmic (a “zigzag” or snake placement pattern that alternates heavy and light experts across GPU slots). The evolved pipeline also settled on the intuitive replication rule of duplicating only overloaded experts. Practically, the result implies much faster, lower-cost online reconfiguration for MoE serving (better GPU utilization and throughput) and demonstrates that automated program synthesis can discover both micro-optimizations and novel heuristics quickly—opening the door to applying ADRS to other scheduling, partitioning, and systems-level ML problems.
Loading comments...
login to comment
loading comments...
no comments yet