Enabling Trillion-Parameter Models on AWS EFA (research.perplexity.ai)

🤖 AI Summary
Perplexity announced a new set of inter-node “expert-parallel” kernels that make Mixture-of-Experts (MoE) inference practical at trillion-parameter scale on AWS Elastic Fabric Adapter (EFA) and push state-of-the-art latencies on ConnectX‑7. The team replaced an NVSHMEM-based approach (which faltered on EFA and proxy-based ConnectX‑7 stacks) with a hybrid CPU–GPU design: GPU kernels handle model work while a host proxy thread posts single RDMA writes for grouped token batches. This lets dispatch and combine operations overlap computation and communication, use micro-batching and shared experts, and achieve a single-write-per-peer pattern that dramatically reduces messaging overhead compared with many small transfers. Key technical features include an initial exchange of per-expert token counts so senders can lay out contiguous writes, small reserved private buffers to avoid stalls while routing info propagates, reuse of a TransferEngine for KV cache semantics, and a mix of CUDA unified memory and GDRCopy for low-latency polling and bulk transfers. NVLink is leveraged intra-node to offload a significant fraction of traffic. The result: latencies that exceed prior DeepEP results on ConnectX‑7 and, crucially, the first viable latencies on AWS EFA—opening practical multi-node deployments of trillion-parameter MoE models on AWS instances with limited HBM and requiring inter-node routing.
Loading comments...
loading comments...