Show HN: I wrote a (slightly less slow, but still bad) autodiff from scratch (github.com)

🤖 AI Summary
A developer released two small reverse‑mode autodiff libraries written in C++/CUDA (silly-autodiff and daft-autodiff) as a learning project for C++/CUDA/deep learning. Both implement a minimal tensor/operation DAG (1D/2D tensors only) with operations like matrix × vector, matrix × matrix, inner products, scalar multiply, leaky ReLU, max‑pool, flatten/concat and single‑channel convolution (implemented by “unrolling” the kernel into a larger matrix so convolution becomes a matrix multiply). It uses cuBLAS where possible (so you’ll see many CUBLAS_OP_T transposes), requires an nvcc toolchain and CUDA GPU, includes examples (forward NN, LeNet) and a test suite, and offers a batchCompute helper that preloads many inputs to reduce host/device copies (not a conventional minibatch API). This is explicitly pedagogical and quite slow: the author reports daft‑lenet taking ~1,400s per epoch on a GTX 1060 vs ~20s for a simple PyTorch implementation. The repo highlights practical AD/CUDA lessons—reverse accumulation seed propagation, memory ownership issues in a DAG (double‑delete pitfalls), the im2col‑style convolution tradeoffs, and build/compatibility pain (glibc problems; Release builds use -O3, --use_fast_math, -march=native but may break tests). Planned work includes performance tuning (goal: within 10× PyTorch), AlexNet, BPTT, weight I/O/ONNX, and a REPL. It’s a useful reference for anyone learning how to wire up GPU autodiff from scratch and the real engineering tradeoffs involved.
Loading comments...
loading comments...