Weight-sparse transformers have interpretable circuits [pdf] (cdn.openai.com)

🤖 AI Summary
OpenAI researchers trained "weight-sparse" transformer models—GPT-2–style decoders with the vast majority of weights set to zero—to produce far simpler, human-readable computation graphs for a suite of hand-crafted language tasks. They then used a structured pruning procedure to isolate the minimal set of neuron/attention/value nodes and nonzero weight edges (a "circuit") that suffice to achieve target task loss. These circuits are compact (often single-digit nodes per subroutine), and internal activations map cleanly to intuitive concepts like "quote detector" or "list nesting depth." As a rigorous check, mean-ablating every node except those in a recovered circuit preserves task performance, while removing circuit nodes breaks it. Sparse models’ minimal circuits are roughly 16× smaller than those extracted from dense models with matched pretraining loss. Technically, sparsity was enforced across all parameters (including embeddings), with extreme L0 sparsity in some runs (≈1/1,000 weights nonzero), mild activation sparsity, and a learned binary-mask pruning scheme. They show a capability–interpretability tradeoff: increasing sparsity improves interpretability but reduces raw capability, while scaling total parameter count improves the Pareto frontier. The paper also gives preliminary methods to bridge sparse models to explain dense ones and releases weights and visualization code. Main limitations: sparse models must be trained from scratch and are inefficient to scale, so preserving interpretability at frontier capabilities remains an open challenge.
Loading comments...
loading comments...