Testing our safety defenses with a new bug bounty program (www.anthropic.com)

0 points 4 days ago ago | visit original

🤖 AI Summary

Anthropic has launched a new invite-only bug bounty program in partnership with HackerOne to rigorously test its latest safety defenses, focusing on universal jailbreaks in its updated Constitutional Classifiers system. These classifiers, designed to enforce strict principles around sensitive content—particularly related to chemical, biological, radiological, and nuclear (CBRN) threats—are integrated within the Claude 3.7 Sonnet model. The program offers up to $25,000 in rewards for vulnerabilities that could consistently bypass safety measures across diverse topics, aiming to bolster adherence to the AI Safety Level-3 (ASL-3) Deployment Standard outlined in Anthropic’s Responsible Scaling Policy. This initiative is significant for the AI/ML community as it addresses the critical need for robust, scalable safety measures in increasingly capable language models. By inviting skilled security researchers and red teamers to identify potential exploitation risks before public deployment, Anthropic is advancing best practices for AI safety and responsible model scaling. The program builds on prior efforts from last year and will transition participants to a new phase testing the upcoming Claude Opus 4 model, reflecting a continuous commitment to refine and harden AI safeguards against misuse, especially in high-risk domains related to biological threats. Anthropic’s collaboration with the security research community underlines the growing importance of proactive, collaborative efforts to identify and patch vulnerabilities that could lead to harmful outcomes. Researchers interested in participating can apply for invitations, emphasizing structured feedback and iterative improvement to align with stringent safety protocols, thereby contributing to safer, responsible AI deployment.

Loading comments...

loading comments...