Anthropic Expands AI Model Safety Bug Bounty Program

Darius Baruo
Aug 08, 2024 14:47

Anthropic broadens its AI model safety bug bounty program to address universal jailbreak vulnerabilities, offering rewards up to $15,000.

The rapid advancement of artificial intelligence (AI) model capabilities necessitates equally swift progress in safety protocols. According to Anthropic, the company is expanding its bug bounty program to introduce a new initiative aimed at finding flaws in the mitigations designed to prevent misuse of their models.

Bug bounty programs are essential in fortifying the security and safety of technological systems. Anthropic’s new initiative focuses on identifying and mitigating universal jailbreak attacks, which are exploits that could consistently bypass AI safety guardrails across various sectors. This initiative targets high-risk domains such as chemical, biological, radiological, and nuclear (CBRN) safety, as well as cybersecurity.

Our Approach

To date, Anthropic has operated an invite-only bug bounty program in collaboration with HackerOne, rewarding researchers for identifying model safety issues in publicly released AI models. The newly announced bug bounty initiative aims to test Anthropic’s next-generation AI safety mitigation system, which has not yet been publicly deployed. Key features of the program include:

Early Access: Participants will receive early access to test the latest safety mitigation system before its public deployment. They will be challenged to identify potential vulnerabilities or ways to circumvent safety measures in a controlled environment.
Program Scope: Anthropic offers bounty rewards of up to $15,000 for novel, universal jailbreak attacks that could expose vulnerabilities in critical, high-risk domains such as CBRN and cybersecurity. A universal jailbreak is a type of vulnerability allowing consistent bypassing of AI safety measures across a wide range of topics. Detailed instructions and feedback will be provided to program participants.

Get Involved

This model safety bug bounty initiative will initially be invite-only, conducted in partnership with HackerOne. While starting as invite-only, Anthropic plans to broaden the initiative in the future. This initial phase aims to refine processes and provide timely, constructive feedback to submissions. Experienced AI security researchers or those with expertise in identifying jailbreaks in language models are encouraged to apply for an invitation through the application form by Friday, August 16. Selected applicants will be contacted in the fall.

In the meantime, Anthropic actively seeks reports on model safety concerns to improve current systems. Potential safety issues can be reported to usersafety@anthropic.com with sufficient details for replication. More information can be found in the company’s Responsible Disclosure Policy.

This initiative aligns with commitments Anthropic has signed with other AI companies for responsible AI development, such as the Voluntary AI Commitments announced by the White House and the Code of Conduct for Organizations Developing Advanced AI Systems developed through the G7 Hiroshima Process. The goal is to accelerate progress in mitigating universal jailbreaks and strengthen AI safety in high-risk areas. Experts in this field are encouraged to join this crucial effort to ensure that as AI capabilities advance, safety measures keep pace.

Image source: Shutterstock

Share it on social networks