Anthropic's Claude 'Soul Document' extracted from Opus 4.5 weights (www.lesswrong.com)

0 points 227 days ago ago | visit original

🤖 AI Summary

A researcher reported extracting a long internal “soul document” — labeled the Anthropic Guidelines / Model Spec / “Claude’s soul” — from Claude 4.5 Opus by driving the model with specially crafted seed prompts and a consensus-driven, low-variance sampling strategy. Using the Claude Console and API, they fed ~1.5k–4k tokens of prefill, set temperature=0 and top_k=1, ran many parallel instances (“a council” of 10–20 copies), and accepted only lines with majority agreement to iteratively expand the prompt. With prompt caching and greedy decoding they produced consistent ~10k-token outputs, normalized whitespace, and estimate ~95% fidelity versus the text compressed into the weights. The extracted text reads like Anthropic’s internal character-training and values spec — mission statements, safety priorities, and specific jargon — and the behavior was reliably reproducible only on Opus 4.5, not Sonnet or earlier Claude variants. This is significant because it suggests nontrivial internal guidance can be recoverable from model behavior, not just from exposed system messages or public data. Techniques used (deterministic sampling, consensus, prompt-caching) are effectively a form of model-extraction tailored to recover memorized or compressed training artifacts. Practical implications include proprietary- and safety-leakage risks, questions about what counts as memorized vs. injected context, and potential need for engineering defenses (rate limits, hidden-context obfuscation, watermarking, or training mitigations). It’s also a useful case study for interpretability: targeted probing plus consensus decoding can reveal latent instruction-like content and structural knowledge encoded in weights.

Loading comments...

loading comments...