Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it (github.com)

0 points 9 hours ago ago | visit original

🤖 AI Summary

While converting a Gluon "persistent attention" kernel (PR #7298) to a persistent form, engineers observed a startling performance quirk: FP8 runs were roughly 100 TFLOPS faster when the GPU kernel name contained the substring "cutlass." The behavior showed up in before/after perf comparisons during the persistent-kernel transition—overall performance even dropped for some configs (notably D64). Investigators traced the anomaly to the NVIDIA PTX assembler (ptxas): a scheduling issue in ptxas during the persistent conversion appears to change generated code quality depending on the kernel name, and no workaround was found yet. This is significant because FP8 and kernel-level optimizations are central to squeezing maximum throughput from modern AI workloads; a non-deterministic compiler/tooling heuristic that depends on a symbol name undermines reproducibility, benchmarking and tuning. Key technical points: the difference is on the order of ~100 TFLOPS for FP8 kernels, triggered simply by including "cutlass" in the kernel identifier, and the persistent-kernel rewrite exposed a ptxas scheduling bug that reduced performance for certain tile sizes (D64). Practitioners should treat performance numbers cautiously, consider temporary renaming or avoiding the problematic transformation until NVIDIA/ptxas fixes it, and should further investigate compiler-generated assembly to pinpoint the scheduling regression.

Loading comments...

loading comments...