🤖 AI Summary
A new open-source project automates the tedious — but performance-critical — step of writing Triton backward (gradient) kernels by combining a small curated dataset and LLM-assisted code generation. The repo ships a dataset of 500 forward–backward Triton stub pairs (collected from permissively licensed GitHub repos), precomputed embeddings, and a retrieval-augmented-generation (RAG) workflow: embed an arbitrary forward stub, retrieve the N most similar forward kernels, paste their backward implementations into the LLM context, and prompt the model to produce an efficient backward kernel. The tool includes grad-correctness checks, benchmarks, snapshots/rollbacks, process isolation, and orchestration utilities so you can validate and iterate on generated kernels.
Technically significant because the hard part of fast backward kernels is not symbolic differentiation of math but designing a parallelization schedule that avoids heavy synchronization (e.g., atomics). The project explicitly targets that: examples show transforming a forward schedule that tiles over Q (forcing atomics when naively differentiated) into a backward schedule that tiles over K/V so each thread accumulates local gradients and issues regular stores. The approach accepts that LLMs struggle to write kernels from scratch but benefits strongly from concrete similar examples in context. Current limitations: single-file execution, limited support for heavily quantized kernels or gradients computed inside the forward, and small test shapes; original code is Apache-2.0 with third-party licenses noted.
Loading comments...
login to comment
loading comments...
no comments yet