Overview
Constitutional Classifiers is an AI safety technique developed by Anthropic that defends LLMs against jailbreaks — especially “universal” jailbreaks that bypass safety guardrails across an entire category of harmful queries. The system wraps a base model with trained input and output classifiers, reducing jailbreak success rates from ~86% to ~4.4% while adding only ~0.38% overrefusal on harmless queries and ~23.7% additional compute cost.
The technique extends AI safety work by treating classifier training as a constitution-driven synthetic data problem, analogous to Constitutional AI’s use of a written principle document to guide model behavior.
How it works
Constitution
A human-authored document specifying allowed and disallowed content classes. For CBRN (chemical, biological, radiological, nuclear) protection, the constitution distinguishes benign chemistry (mustard recipes) from prohibited content (mustard gas synthesis). The constitution is the authoritative policy document; classifiers are derived from it rather than hand-labeled.
Synthetic data generation
Claude is used to generate large volumes of synthetic prompts and completions covering all content classes defined in the constitution. These are then augmented:
- Translated into multiple languages
- Rewritten in the style of known jailbreak patterns (role-play, ciphers, unusual capitalization, many-shot, prompt injection)
- Supplemented with contractor-generated benign queries to reduce overrefusals
Classifier training and deployment
Two classifiers are trained on the synthetic data:
- Input classifier — screens the user’s prompt before it reaches the model
- Output classifier — screens the model’s completion before it reaches the user
Both classifiers run in addition to the base model’s own safety training. Content flagged by either classifier is blocked. This layered approach means an attacker must defeat the base model’s safety training and both classifiers simultaneously.
Evaluation results
Human red-teaming (prototype, 2024)
- 183 active participants, >3,000 hours over two months
- Up to $15,000 bounty offered for a confirmed universal jailbreak
- No universal jailbreak found against the prototype (CBRN-focused)
- Prototype had high overrefusal rates and high compute cost
Automated evaluation (updated system)
On 10,000 synthetically generated jailbreak prompts:
| Metric | Baseline | With classifiers |
|---|---|---|
| Jailbreak success rate | 86% | 4.4% |
| Overrefusal increase | — | +0.38% (not significant) |
| Compute overhead | — | +23.7% |
Public demo (Feb 3–10, 2025)
Focus: chemical weapons queries across 8 graded questions.
- 339 participants, >300,000 interactions, ~3,700 collective hours
- System held for five days with no universal jailbreak
- On day six/seven: 4 participants cleared all 8 levels; 1 confirmed universal jailbreak found
- Most effective attack vectors: ciphers/encodings, role-play via system prompts, keyword substitution (e.g. “Soman” → “water”), prompt injection
Limitations
- Does not prevent all universal jailbreaks; sufficiently motivated attackers can succeed given enough time
- New attack techniques not represented in the synthetic training data may bypass classifiers
- Real-world deployment would supplement with rapid-response monitoring (flagging repeated classifier trips, updating classifiers in response to discovered attacks)
- Constitution must be actively maintained to cover novel attack surfaces
Relationship to Responsible Scaling Policy
Constitutional Classifiers is a key mechanism for Anthropic’s ASL-3 deployment tier — models that have crossed the CBRN capability threshold. The Responsible Scaling Policy requires provably effective safeguards before such models can be deployed; Constitutional Classifiers provides a measurable, evolvable safeguard.
Conceptual connections
- — the attack class Constitutional Classifiers is designed to defeat
- AI alignment — broader program; Constitutional AI and Constitutional Classifiers are Anthropic’s two primary alignment techniques
- AI — language model foundations
Resources
- 2025-02-03 ◦ Constitutional Classifiers (Anthropic) — research post announcing the system with methodology, results, and live demo findings
- 2025-02-03 ◦ Constitutional Classifiers (arXiv) — full technical paper by the Anthropic Safeguards Research Team
- 2026-06-24 ◦ Pliny HackAPrompt Dataset (HuggingFace) — open-sourced corpus of 16,902 competitive jailbreak submissions; documents the broader red-teaming ecosystem that validates defenses like Constitutional Classifiers
Notable red-team outcomes
The competitive red-teaming community has produced notable benchmarks against Constitutional Classifiers:
- Valen Tagliabue won HackAPrompt 1.0 (2023, ~$5K prize) then later competed in Anthropic’s Constitutional Classifier competition and became the first person to clear all eight jailbreak challenge levels, winning ~$23K — a significant empirical benchmark for the system’s real-world resistance ceiling
- His trajectory from HackAPrompt competitor to full-time AI red-teaming researcher illustrates the career pipeline from competitive jailbreaking into formal AI safety work
See for the broader competitive red-teaming methodology and for the injection sub-class.