Backpropagation is a leaky abstraction (2016) (karpathy.medium.com)

🤖 AI Summary
Stanford’s CS231n notes argue that you really should understand backpropagation because it’s a “leaky abstraction”: modern frameworks automate gradients, but the backward pass has non-obvious failure modes that affect architecture, initialization and debugging. The post reenforces this by having students implement forward and backward passes in raw numpy. Concrete examples: sigmoids/tanh can “saturate” so z*(1−z)→0 (its maximum is 0.25 at z=0.5), causing vanishing gradients unless you carefully initialize weights and preprocess data; ReLUs produce zero gradients whenever a unit’s output is clamped to zero, producing permanently “dead” neurons if initialization or an aggressive update drives them inactive; and vanilla RNN backprop multiplies the gradient by the recurrence matrix repeatedly, so its largest eigenvalue determines vanishing or exploding gradients (motivating gradient clipping or LSTMs). A real-world bug illustrates the point: a DQN TensorFlow implementation clipped the raw TD error with clip_by_value, which has zero gradient outside the clip range—silencing learning—whereas the intended effect is to clip gradient magnitude (use Huber loss or explicit gradient clipping). Takeaway: don’t treat backprop as magic—choices of nonlinearity, initialization, loss formulation and gradient handling have direct, sometimes counterintuitive effects on training stability and capacity. Writing or inspecting low-level backprop makes you better at designing, debugging and fixing these issues.
Loading comments...
loading comments...