ThunderMittens for Your ThunderKittens (hazyresearch.stanford.edu)

🤖 AI Summary
ThunderMittens is a Metal (Apple GPU) port of ThunderKittens (TK), a small DSL and kernel collection originally tuned for NVIDIA datacenter GPUs. The team adapted TK to Apple M2 Pro hardware to show that the same tile-based abstractions can be reused for on-edge training and inference—unlocking better privacy, user-personalized models, and wider experimentation on consumer devices. The announcement emphasizes portability: only one major abstraction change was needed (16x16 base tiles → 8x8) to fit Metal’s per-thread register limits and the metal::simd_matrix<T,8,8> intrinsic. Technically, the port reflects the M2 Pro’s hardware profile (~200 GB/s memory, ~6.5 TFLOPs) where memory bandwidth is plentiful relative to compute (vs. an RTX 4090’s ~1000 GB/s and ~82.6 TFLOPs). Consequences: shared memory and complex swizzling are less valuable, simple kernels often suffice, bf16 support is spotty (necessitating manual unrolling hacks), and occupancy is critical because register pressure can kill performance. NVIDIA-specific TMA/WGMMA and async loads were removed and register layouts adjusted for metal::simdgroup_multiply_accumulate. Results are competitive: attention inference is within ±15% of MLX’s implementation and GEMM is ~9% faster on many sizes, while kernel code is far more concise (11 lines vs. 100+). The repo is early-stage with plans for more kernels and M3/M4 tuning, and the project invites open-source contributions.
Loading comments...
loading comments...