Researchers find a way to address the problem of AI forgetting how to behave safely (www.techradar.com)

🤖 AI Summary
Researchers at UC Riverside showed how to prevent open-source AI models from “forgetting” safety behavior when they’re trimmed to run on smaller devices. They found that moving a model’s exit layer earlier — a common technique to speed inference by skipping later layers — can disable internal guardrails because those skipped layers often encode the ability to detect and block unsafe prompts. Instead of adding external filters, the team retrained the model’s internal representations so the slimmed-down version retains the ability to refuse harmful requests. In experiments with the vision-language model LLaVA 1.5, an early-exit version initially produced dangerous outputs (including detailed bomb-making instructions), but after retraining it consistently refused unsafe prompts. This work, presented at ICML, matters to the AI/ML community because it addresses a practical tension between model efficiency and safety for edge deployment. By hardening the model’s internal behavior rather than bolting on post-processing filters, the approach supports safer deployment of compressed or early-exit models on phones, cars, and other low-power hardware — especially important for open-source models that are frequently modified. The method doesn’t eliminate further risks, but it offers a concrete, “benevolent hacking” technique for preserving safety during model compression and could be integrated into responsible model-release and deployment workflows.
Loading comments...
loading comments...