🤖 AI Summary
An engineer spent weeks reverse‑engineering GEMM behavior on Hopper and Blackwell (using CuteDSL), mapped the decade‑long evolution from Volta→Ampere→Hopper→Blackwell, and used those observations plus Jensen Huang’s Rubin reveal to predict Vera Rubin’s microarchitecture. The writeup argues Nvidia’s true moat is full‑stack engineering—carefully removing “dirty work” across hardware, compilers, runtimes and system integration—rather than any single feature. Key technical takeaways: TensorCores have steadily grown in scale and supported precisions (Blackwell adds FP8 and MXFP formats while Blackwell Ultra trades off some high‑precision FP compute). The author predicts Rubin will roughly double TensorCore tiles (to ~256×N×256-bit) and move from Blackwell’s 2‑CTA MMA to a 4‑CTA collaborative MMA, increasing scheduler pressure on the CGA. The data path evolved from register‑based FMAs → cp.async bypasses → TMA into SMEM → Blackwell’s TMEM decoupling TensorCores from CUDA cores; asynchronous primitives (cp.async, MBarrier, Async Proxy) now enable finer‑grained pipelines.
Significant practical implications: Blackwell’s die tradeoffs (more TMEM/DSMEM, fewer SMs) left the SFU underprovisioned, creating a Softmax/Attention bottleneck that limits throughput despite huge GEMM gains—pushing some workloads toward sparse attention, though the author argues Softmax remains fundamental from an optimal‑transport view. Blackwell’s mix of synchronous/asynchronous tcgen05 instructions, per‑thread vs warp vs multi‑SM granularities, and TMEM alloc/dealloc semantics increase programming and scheduling complexity, demanding richer compiler/runtime support. Finally, lower kernel latencies expose Grace CPU and NVLink microsecond challenges, meaning future gains will depend as much on system/software orchestration as on raw ALU/TensorCore scaling.
Loading comments...
login to comment
loading comments...
no comments yet