🤖 AI Summary
            An author published a hands-on CUDA “hitchhiker’s guide” that walks readers from basics to an optimized SGEMM kernel that reaches about 95% of cuBLAS (NVIDIA’s closed-source BLAS) performance. The post is framed around why GEMM matters for ML (training/inference cost) and shows that with careful system engineering you can approach vendor-library speeds—useful for custom inference kernels or research that needs bespoke memory/compute trade-offs.
The tutorial covers GPU hardware and execution fundamentals (HBM → L2 → L1/shared → registers bandwidth ordering; L1 and shared memory are the same physical resource; tensor cores don’t do fp32), the SIMT model, thread/block/warp hierarchy, and occupancy (limited by shared memory per block, threads per SM, and registers per thread — Hopper supports up to 64 resident warps/SM). It demonstrates latency-hiding via many warps and profiling with Nsight Compute. The matmul optimizations are presented incrementally: naive → shared-memory caching → thread tiling → vectorized/coalesced access → prefetching/pipelining → swizzling → warp tiling → k-split, showing how tiling, memory coalescing, register/shared-memory tradeoffs and pipeline prefetching together yield near-cuBLAS performance. The write-up is both a practical tutorial and a reminder that deep understanding of hardware+occupancy can unlock major ML performance gains.
        
            Loading comments...
        
        
        
        
        
            login to comment
        
        
        
        
        
        
        
        loading comments...
        no comments yet