LPLB: An early research stage MoE load balancer based on linear programming (github.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

LPLB is an early-stage research load balancer for Mixture-of-Experts (MoE) models that uses linear programming to dynamically redistribute token work across redundant experts. Building on EPLB’s reordering, LPLB selects heavy experts to replicate according to a static topology (Cube, Hypercube, Torus, or custom r2o matrices), models the redundant expert links as edges with capacities (current token counts), and solves an LP per batch to optimally reassign tokens. Real-time workload stats can come from the user, torch.distributed, or a Deep-EP buffer; NVLINK/NVSHMEM are used to accelerate synchronization and reduce allreduce overheads. The project includes an embedded single-SM Interior Point Method LP solver that leverages NVIDIA cuSolverDx and cuBLASDx and requires CUDA Toolkit >= 12.6.3. Technically, LPLB minimizes imbalance within an expert-parallel group subject to edge capacities, returning physical expert indices for each logical selection. Solver latency is ≈100 µs for intra-node problems (longer cross-node), so its overhead matters for small batches. Current limitations: it balances total token counts (not nonlinear GEMM timing), can underperform EPLB under extreme global imbalance, and is still being evaluated for real-world gains. DeepEP is recommended for practical use, and the r2o matrix lets teams experiment with alternative replication topologies.

Loading comments...

loading comments...