Efficient and Lossless Moe Diffusion LLM Inference with I/O-Aware Expert Offload (tide-paper.vercel.app)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A team of researchers from the University of Central Florida, Mobi.AI, and Rice University has introduced TIDE, an innovative inference system designed for efficient and lossless execution of Mixture-of-Experts (MoE) Diffusion Large Language Models (dLLMs). As dLLMs gain traction for their enhanced hardware utilization and context handling over autoregressive models, they face challenges in deployment on resource-limited devices. TIDE addresses these issues by leveraging the consistent patterns of expert activations during the diffusion process, utilizing an interval-based expert refresh strategy to minimize data transfer between CPU and GPU. This is achieved through optimal scheduling, which maximizes efficiency while avoiding training-related overhead. In practical terms, TIDE delivers significant performance improvements, achieving up to 1.5 times better throughput on specific models without requiring any retraining. It reduces costly expert migration overhead, maintains high GPU utilization, and promotes expertise reuse in nearby decoding steps, effectively offering a "free lunch" for dLLM inference. This advancement not only enhances the practicality of deploying complex AI models in constrained environments but also paves the way for broader applications of dLLMs in real-time settings, making high-performance AI more accessible.

Loading comments...

loading comments...