Fast Matrix Multiply on an Apple GPU (percisely.xyz)

0 points 2 hours ago ago | visit original

🤖 AI Summary

A developer implemented a high-performance matrix multiply for Apple GPUs (2022 MacBook Air) in Metal and achieved about 2.5 trillion 32-bit FLOPs — performance on par with Apple’s closed-source Metal Performance Shaders (MPS). The implementation follows the standard tiled-decomposition pattern (tiles → sub-tiles → tensor-core multiplies) so each SIMD processor and threadgroup works on product tiles, using threadgroup memory to reuse shared rows/columns. Critically, the kernel relies on an undocumented asynchronous device→threadgroup copy instruction, simdgroup_async_copy (plus a matching __metal_wait_simdgroup_events), exposed via a low-level signature and a small template wrapper that copies 2D tiles into threadgroup memory for tensor-core use. The post has two practical takeaways for AI/ML engineers: first, Apple tensor cores can match MPS speeds if you carefully tile work and orchestrate SIMD/threadgroup layout (but you must ensure matrix dimensions align with tile sizes or handle padding/branching costs); second, the simdgroup_async_copy has counterintuitive performance: on M2 hardware a single thread doing the async copy can be markedly faster than many threads cooperating because the copy emits expensive integer indexing instructions. That microbenchmark insight and the undocumented API mean significant raw performance is accessible on consumer Apple GPUs, but with portability and maintenance risks — you’ll need Metal expertise and careful benchmarking to reproduce these gains.

Loading comments...

loading comments...