Towards Automated GPU Kernel Generation (simonguo.tech)

🤖 AI Summary
Simon Guo and Alex Zhang publish a one-year retrospective on KernelBench, their benchmark and evaluation framework for automated GPU kernel generation born out of the GPU MODE hackathon. KernelBench frames the task as: given a PyTorch operator, transpile it into inline CUDA plus a PyTorch wrapper, then verify numerical correctness and measure performance. The benchmark uses three difficulty levels and a fast_p metric (percentage of problems where a model produces a correct kernel that is p× faster than PyTorch eager/compile). The authors report frontier LLMs produced correct, faster kernels less than 20% of the time, and often fail to leverage hardware-specific intrinsics (e.g., TensorCores). They call out a critical data bottleneck: CUDA code is extremely sparse in public corpora (~0.073% of The Stack), which limits model priors. Technically, KernelBench revealed practical lessons about evaluation (profiling variance, reward hacking) and effective methods to scale search. “Monkey” parallel sampling — generating many candidate kernels and filtering with a verifier — yielded big gains (DeepSeek‑V3 sampled 100× improved fast_1 on Level 2 from 4% → 37%). Iterative refinement with feedback further boosted results (DeepSeek‑R1 moved fast_1 from 36% → 72% on Level 2). Follow-ups in the community (BackendBench, TritonBench, FlashInferBench) emphasize stronger correctness checks and end-to-end integration. Looking ahead, the authors highlight compiler co‑design, new DSLs, hardware‑aware agents, and richer GPU code corpora as the path to make automated kernel generation reliable and production‑grade.
Loading comments...
loading comments...