🤖 AI Summary
The author argues that much of what people take as “interpretable features” learned by sparse autoencoders (SAEs) might instead be a consequence of high-dimensional geometry: if you throw an exponentially large set of random, normalized vectors into an n-dimensional latent space, with high probability a small subset will have large inner products with any fixed activation. That means simply picking the top-k inner products (a trivial “sparse autoencoder”) can produce apparently meaningful features without any training. The post cites work showing SAEs often don’t beat simple baselines and a paper showing SAEs can “interpret” randomly initialized transformers nearly as well as trained ones, motivating skepticism about whether SAE training is doing anything beyond selecting lucky random directions.
Technically, random vectors drawn i.i.d. (e.g., ±1 entries normalized) form an almost-orthogonal, overcomplete basis: most inner products concentrate around zero with standard deviation ~1/√n, but when the number of random vectors m is exponentially large in n, extreme-value effects guarantee a few will be highly aligned with any given latent direction. A practical sparse encoding is obtained by keeping the top inner products and fitting outputs (e.g., least-squares). Implications: interpretability claims based on SAEs should control for random-projection baselines; researchers should compare trained SAEs to random initializations, quantify how much training improves feature quality, and beware hypothesis multiplicity when many random probes are tested.
Loading comments...
login to comment
loading comments...
no comments yet