Mercury: Unlocking Multi-GPU Optimization for LLMs via Remote Memory Scheduling [pdf] (storage.googleapis.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Mercury is a new compiler and loop-based IR (CommIR) that treats remote GPU memory as a first-class, schedulable layer of the memory hierarchy to automatically generate high-performance multi-GPU operators for LLMs. By abandoning the common “everything must fit in local HBM” assumption, Mercury exposes a much larger design space—enabling asynchronous, loop-shifted schedules that stagger access to shared inputs (e.g., KV caches) across devices, reduce local memory pressure, and unlock larger tiling and reuse opportunities. The result: automated operator implementations that match or exceed hand-tuned designs and reduce the heavy manual engineering currently required to optimize attention and large linear operators across diverse hardware and network topologies. Technically, Mercury introduces CommIR with structured transformation primitives (parallelize, shift, shard, replicate) that express inter-GPU schedules and remote-memory access patterns and can be lowered to efficient collectives or P2P transfers. An auto-tuner explores CommIR candidates and synthesizes communication plans (e.g., ring-like passes) without bespoke kernels. Mercury sits as a middle layer between graph-level optimizers and intra-GPU tensor compilers, supporting single- and multi-node setups. Evaluations show consistent speedups over state-of-the-art hand-optimized libraries (avg ~1.56×; up to 1.62× vs. model-level 3D-parallelism) across operators, sequence lengths, and hardware, demonstrating strong practical gains from compiler-driven remote memory scheduling.

Loading comments...

loading comments...