Tessera: Unlocking Heterogeneous GPUs Through Kernel-Granularity Disaggregation (arxiv.org)

🤖 AI Summary
A new system called Tessera has been introduced to enhance the efficiency of heterogeneous GPU clusters in AI workloads. Unlike existing disaggregation methods that work at a coarse granularity and are dependent on specific model architectures, Tessera operates at a kernel-granularity level. This allows it to better align diverse resource demands of different computation kernels within a single application with the varied capabilities of heterogeneous GPUs. By integrating offline analysis and online adaptation, Tessera ensures effective communication and computation overlap, while also implementing workload-aware scheduling for lightweight runtime adjustments. The significance of Tessera for the AI/ML community lies in its ability to improve both serving throughput and cost efficiency notably; evaluations showed increases of up to 2.3x in throughput and 1.6x in cost efficiency compared to previous methods. Furthermore, Tessera demonstrates flexibility by supporting various model architectures beyond traditional limits. Impressively, a combination of heterogeneous GPUs running Tessera can outperform the throughput of two high-end homogeneous GPUs, offering a more cost-effective solution for large model inference. This breakthrough paves the way for more efficient use of GPU resources in AI applications, potentially transforming how organizations approach model deployment.
Loading comments...
loading comments...