Cover Image for Anthropic introduces a new security system that it claims can stop almost all AI jailbreaks.
Tue Feb 04 2025

Anthropic introduces a new security system that it claims can stop almost all AI jailbreaks.

A new security measure incorporates values into large language models.

Anthropic has introduced a new conceptual safety measure known as "constitutional classifiers," which was tested in its language model Claude 3.5 Sonnet. This initiative aims to incorporate a set of human values into language models, in an effort to mitigate the abusive use of these systems through harmful prompts.

The company's safety research team revealed that, following the implementation of constitutional classifiers, there was an 81.6% reduction in successful jailbreaks of the Claude 3.5 model. Additionally, it was observed that this new safety measure had a minimal impact on performance, with an absolute increase of 0.38% in production rejections and a 23.7% increase in inference overhead.

Despite the ability of language models to generate risky content, Anthropic, like others in the industry, is focused on challenges related to chemical, biological, radiological, and nuclear (CBRN) content. To demonstrate the effectiveness of constitutional classifiers, the company launched a demonstration that challenges users to overcome eight levels related to CBRN content jailbreaks. This strategy has drawn criticism from some who view it as a way of shifting security responsibility onto the community without any form of reward.

Anthropic explained that the jailbreaks that managed to bypass their defense did so by circumventing the classifiers rather than preventing them from functioning. They provided examples of techniques such as benign paraphrasing, where the language is modified to describe the extraction of toxins, and length exploitation, which involves confusing the model with unnecessary details.

Despite the success in reducing jailbreaks, the company acknowledged that the prompts used during the constitutional classifiers’ testing exhibited "impractically high rejection rates" and admitted the possibility of false positives and negatives in their rubric-based testing system.

On the other hand, a new language model, DeepSeek R1, has emerged in China. This open-source model, capable of running on modest hardware, has faced similar challenges with jailbreak techniques, such as "God mode," which are used to bypass protections related to the discussion of controversial topics in Chinese history and politics.