AMS – Detect unsafe LLMs in 30 seconds via activation analysis (github.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

The Activation-based Model Scanner (AMS) has been introduced, a tool designed to verify the safety of language models by analyzing their activation patterns. Using a methodology known as Activation Fingerprinting, AMS detects if a model retains its safety training by measuring the separation of safety-relevant concepts within its activation space. The significance of this development lies in its ability to quickly identify models that have undergone unsafe modifications, such as "uncensored" fine-tunes or abliterated models, thus helping practitioners ensure compliance with safety standards before deployment. AMS operates on a GPU and can conduct scans in 10 to 40 seconds, with specific commands that facilitate both standard and custom scanning modes. It categorizes models based on their safety training status into PASS, WARNING, and CRITICAL tiers, providing developers with crucial insights regarding harmful content processing, injection resistance, and refusal capability. Built to be accessible through PyPI or direct GitHub installation, AMS aims to enhance the safety of AI/ML deployments by offering a rapid and comprehensive safety check that addresses the growing concerns around the irresponsible use of language models.

Loading comments...

loading comments...