Overview

Constitutional Classifiers is an AI safety technique developed by Anthropic that defends LLMs against jailbreaks — especially “universal” jailbreaks that bypass safety guardrails across an entire category of harmful queries. The system wraps a base model with trained input and output classifiers, reducing jailbreak success rates from ~86% to ~4.4% while adding only ~0.38% overrefusal on harmless queries and ~23.7% additional compute cost.

The technique extends AI safety work by treating classifier training as a constitution-driven synthetic data problem, analogous to Constitutional AI’s use of a written principle document to guide model behavior.

How it works

Constitution

A human-authored document specifying allowed and disallowed content classes. For CBRN (chemical, biological, radiological, nuclear) protection, the constitution distinguishes benign chemistry (mustard recipes) from prohibited content (mustard gas synthesis). The constitution is the authoritative policy document; classifiers are derived from it rather than hand-labeled.

Synthetic data generation

Claude is used to generate large volumes of synthetic prompts and completions covering all content classes defined in the constitution. These are then augmented:

Classifier training and deployment

Two classifiers are trained on the synthetic data:

Both classifiers run in addition to the base model’s own safety training. Content flagged by either classifier is blocked. This layered approach means an attacker must defeat the base model’s safety training and both classifiers simultaneously.

Evaluation results

Human red-teaming (prototype, 2024)

Automated evaluation (updated system)

On 10,000 synthetically generated jailbreak prompts:

Metric Baseline With classifiers
Jailbreak success rate 86% 4.4%
Overrefusal increase +0.38% (not significant)
Compute overhead +23.7%

Public demo (Feb 3–10, 2025)

Focus: chemical weapons queries across 8 graded questions.

Limitations

Relationship to Responsible Scaling Policy

Constitutional Classifiers is a key mechanism for Anthropic’s ASL-3 deployment tier — models that have crossed the CBRN capability threshold. The Responsible Scaling Policy requires provably effective safeguards before such models can be deployed; Constitutional Classifiers provides a measurable, evolvable safeguard.

Conceptual connections

Resources

Notable red-team outcomes

The competitive red-teaming community has produced notable benchmarks against Constitutional Classifiers:

See for the broader competitive red-teaming methodology and for the injection sub-class.