Representation Engineering (vgel.me)

🤖 AI Summary
A recent deep-dive demonstrates how "representation engineering" — calculating and injecting layer-wise control vectors into a model’s hidden states — can reliably steer Mistral-7B-Instruct-v0.1 without prompting or finetuning. Using a simple pipeline (build contrasting persona prompt pairs, collect last-token hidden states, take their differences, and fit a single-component PCA per layer) the author produced control vectors (via a repeng library/PyPI package) in about a minute. At inference you add the per-layer vectors to hidden_state before each layer; a signed scalar coefficient sets direction and strength. Applied to axes like “honest⇄dishonest” or “happy⇄sad,” the vectors produced dramatic, tunable changes in outputs from the same prompt—sometimes producing behaviors prompt engineering couldn’t achieve or that prompt-based steering struggled to scale. This technique matters because it offers a compact, computationally cheap mechanism to read and write model behavior, making inner representations directly actionable for interpretability, alignment testing, and fine-grained control. It both empowers research (fast experiments, diagnostic axes) and raises safety concerns: vectors can intensify undesirable behaviors or bypass safety via targeted activation. Key open technical questions remain—how well vectors generalize across prompts and models, which layers matter, interaction effects between multiple vectors, optimal PCA choices, and robustness to coefficient tuning—so the approach is a powerful, low-barrier tool that demands careful study and mitigation.
Loading comments...
loading comments...