The bug that taught me more about PyTorch than years of using it (elanapearl.github.io)

0 points 7 days ago ago | visit original

🤖 AI Summary

A training loss plateau that looked like a hyperparameter or model bug turned out to be a silent PyTorch GPU kernel bug that froze encoder weights. On Apple Silicon (the MPS backend) a niche kernel silently fails when writing to non-contiguous output tensors: in particular addcmul_ and addcdiv_ didn’t update non-contiguous outputs. That broke Adam’s internal state for those parameters — exp_avg updated but exp_avg_sq stayed zero — so adaptive updates became effectively no-ops and the encoder weights never changed. Manual SGD (which directly applies gradients) worked, switching to float64 temporarily avoided the failure, and a minimal reproduction + walkthrough is available on GitHub. The author submitted a PR; these MPS issues affected PyTorch <2.4 and some random ops still showed problems on macOS <15 as of PyTorch 2.10. Why this matters: the bug shows how silent backend/kernel failures can masquerade as model or hyperparameter mistakes and waste days of debugging. Key takeaways for ML engineers: inspect optimizer state tensors (exp_avg, exp_avg_sq) and tensor contiguity when debugging stalled updates, test with simple manual updates to isolate optimizer vs. backward-pass issues, and be cautious running on MPS with older PyTorch/macOS combos. More broadly, the post is a practical, stepwise tour through optimizer internals, memory layout, dispatch and kernel layers — a useful debugging template for framework-level failures.

Loading comments...

loading comments...