The Purple Gradient Problem – pt 1 (sriraam.substack.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

A hobbyist mechanistic-interpretability write-up shows how a simple, training-free intervention—Activation Addition (ActAdd)—can steer an LLM’s quirky “purple gradient” preference when generating website HTML. Using Qwen3-8B, the author computed steering vectors from contrastive prompts (e.g., “ALWAYS USE HTML code with green hex colors …” vs. a negative prompt), then added the resulting vector into the residual stream at a chosen layer with a coefficient. With layer-and-strength sweeps they produced site outputs biased toward green, yellow, blue, or pink instead of the model’s default purple, with the clearest effects when injecting at early layers (2–4) with coefficients around 2–4. The work reproduces and extends ActAdd-style steering (previously shown on GPT-2) and highlights practical and scientific limits: steering is layer- and strength-dependent, late-layer interventions often fail or degrade color fidelity, Qwen3-4B resisted steering entirely, and hex codes in outputs often match those in the steering prompts—suggesting the effect may be token injection/memorization rather than a true redistribution of latent “color” features. Open questions remain about whether color concepts form linear representations (e.g., green = blue + yellow), why very high strengths revert to purple, and how generalizable such steering is. The experiment is a compact, reproducible case study for research into controllable generation and what activation edits reveal about internal model features.

Loading comments...

loading comments...