🤖 AI Summary
Naomi Saphra, an interpretability researcher moving from Harvard’s Kempner Institute to Boston University, argues that to truly understand large language models we must study how they evolve during training — not just inspect their final weights. Drawing an analogy to evolutionary biology, she emphasizes stochastic gradient descent as the “evolutionary” process that shapes internal mechanisms. Saphra says early training dynamics and random initial conditions can lock models into particular solutions or “vestigial” features that look important in hindsight but aren’t causally necessary for generalization. That perspective reframes interpretability: instead of only probing end-state neurons and interventions, researchers should track how structures and behaviors co-emerge across multiple runs and checkpoints.
Technically, this approach uses variation across random initializations and longitudinal access to intermediate checkpoints to correlate internal structure with later generalization, making stronger causal claims about why models behave as they do. Examples include finding that certain internal mechanisms appear just before sudden gains on grammatical tasks in masked language models, and that forcing or preventing selective neurons during training can improve or degrade performance. The main barriers are limited access to proprietary training traces and few multi-run studies. Embracing training-dynamics studies could improve predictive models of behavior, robustness, and the design of safer, more interpretable architectures.
Loading comments...
login to comment
loading comments...
no comments yet