Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs (arxiv.org)

🤖 AI Summary
NVIDIA has introduced CUDA Tile (CuTile), a new Python-based abstraction designed to streamline GPU kernel development while maintaining high efficiency using Tensor Core and Tensor Memory Accelerator (TMA). An independent evaluation of CuTile on NVIDIA's Hopper and Blackwell architectures, specifically the H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition, benchmarks AI workloads like GEMM and LLM inference, revealing that CuTile's effectiveness can vary significantly based on the specific workload and architecture. Remarkably, CuTile achieved up to 1007 TFLOP/s for fused multi-head attention on the B200, surpassing FlashAttention-2 performance by 2.5 times, and required only 60 lines of code, indicating its potential for simplifying coding efforts. However, its GEMM performance reached only 52-79% of cuBLAS in a mere 22 lines of code, although it did not yet match the output of vendor-optimized libraries. This newly introduced framework highlights notable optimization discrepancies across different architectures, as the same CuTile kernel hit only 53% of FlashAttention-2 throughput on the RTX PRO 6000. Notably, Triton maintained 62-101% of cuBLAS performance across platforms without requiring architecture-specific tweaks, showcasing superior portability, a crucial factor for developers facing the complexities of diverse GPU environments.
Loading comments...
loading comments...