Deep dive into the small details of micrograd (omrishneor.github.io)

🤖 AI Summary
This post is a hands‑on breakdown of Andrej Karpathy’s micrograd lecture, unpacking the tiny implementation choices that make a minimal autodiff engine work. The author reconstructs the computational graph idea: Value objects hold data and grad (where v.grad = ∂L/∂v), nodes are composed with operator overloads, and each operation attaches a custom _backward closure that encodes the local derivative (e.g., for multiplication x*y the closure does x.grad += y.data * out.grad and y.grad += x.data * out.grad). That design explains where out.grad (∂L/∂o) comes from and why each operator needs its own backward logic — the closures capture the inputs and compute local partials used by the chain rule: ∂L/∂x = (∂L/∂o)(∂o/∂x). The post also walks through the Value.backward implementation: build a topological ordering of the DAG, set the final node’s grad to 1.0 (∂L/∂L = 1), and iterate nodes in reverse topo order calling each node’s _backward to accumulate gradients (using += to handle branching). Finally, it connects backprop to optimization: after zeroing grads for trainable parameters, run L.backward() and update parameters with v.data -= η * v.grad. The writeup clarifies key autodiff mechanics and intuitions that are essential for anyone wanting to understand or implement neural network frameworks from scratch.
Loading comments...
loading comments...