🤖 AI Summary
AMD Research released Iris, an experimental Triton-based framework that brings SHMEM-like Remote Memory Access (RMA) primitives to pure Python/Triton for multi-GPU programming. Iris exposes simple, Triton-native device APIs (e.g., iris.store) and a symmetric heap abstraction so developers can perform direct remote reads/writes between GPUs from within Triton kernels. The goal is to make multi-GPU code as easy to write and as performant as single-GPU Triton kernels, enabling finer-grained communication/computation overlap and patterns like peer-to-peer GEMM scaling without leaving the Triton ecosystem. This is a research preview (not a product) and is MIT-licensed and open for contributions.
Technically, Iris provides a runtime (iris.iris) that allocates a symmetric heap across ranks, device-side RMA calls callable inside triton.jit kernels, rank-aware barrier/synchronization, and convenience allocation (iris_ctx.zeros). The repo includes examples and benchmarks; a simple two-rank example shows a Triton kernel invoking iris.store to write from source_rank to target_rank using heap base pointers. Requirements: Python 3.10+, PyTorch 2.0+ (ROCm), ROCm 6.3.1+ HIP runtime, and Triton; currently validated on MI300X/MI350X/MI355X. Planned work includes broader AMD GPU testing, RDMA-based multi-node support, and richer end-to-end examples. Install via pip git or a provided Docker Compose development environment.
Loading comments...
login to comment
loading comments...
no comments yet