Heretic: Automatic censorship removal for language models (github.com)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Heretic is a new command‑line tool that automatically strips "safety" refusals from transformer language models without additional fine‑tuning. It couples a parametrized implementation of directional ablation (aka "abliteration") with a TPE optimizer (Optuna) to co‑minimize the number of refusals on harmful prompts and the KL divergence from the original model on harmless prompts. In benchmarks (google/gemma-3-12b-it), Heretic produced the same refusal suppression as human-crafted abliterations (3/100 refusals) while achieving substantially lower KL divergence (0.16 vs. 0.45–1.04), indicating less capability loss. The tool is fully automated, easy to run (pip install heretic-llm; heretic <model>), and can decensor many dense, multimodal, and some MoE models without needing deep transformer knowledge. Technically, Heretic computes per‑layer "refusal directions" as difference‑of‑means of first‑token residuals between harmful and harmless prompts, then orthogonalizes selected transformer matrices (attention out‑projection and MLP down‑projection) to inhibit those directions. Key innovations: a highly flexible, per‑component ablation weight kernel (max/min/position/distance), a float direction_index that linearly interpolates between direction vectors, and separate parameterization for attention vs MLP (MLP interventions tend to be more damaging). Optuna's TPE searches this space automatically; hardware benchmarking picks optimal batch sizes (e.g., Llama‑3.1‑8B ≈45 min on an RTX 3090). Heretic is AGPL‑licensed. While useful for alignment research and capability analysis, it also lowers the bar for decensoring models and raises clear safety and misuse concerns.

Loading comments...

loading comments...