Overview

Red teaming LLMs is the practice of systematically attempting to elicit harmful, unsafe, or policy-violating outputs from a language model in order to discover vulnerabilities before or during deployment. Borrowed from military and cybersecurity traditions, LLM red-teaming ranges from informal adversarial probing by individual researchers to structured competitive events with formal scoring and monetary prizes.

Unlike automated benchmarks, human red-teamers can discover qualitatively novel attack strategies that no benchmark anticipated. This makes red-teaming a critical complement to automated safety evaluations in alignment research.

Forms of red teaming

Internal (closed) red teaming

AI labs employ dedicated red teams who attempt to break the model before release. Anthropic’s prototype Constitutional Classifiers evaluation used 183 participants over >3,000 hours with up to $15,000 bounty for a confirmed universal jailbreak. No universal jailbreak was found in that round.

Public demo / bug bounty

A live model is exposed to the public for a fixed window. Anthropic’s Constitutional Classifiers public demo (Feb 3–10, 2025) attracted 339 participants who collectively spent ~3,700 hours on 8 graded chemical-weapons jailbreak challenges. The system held for five days; on days six and seven, 4 participants cleared all 8 levels (1 confirmed universal jailbreak). See Constitutional classifiers for full evaluation detail.

Competitive platforms

Open competitions where participants submit jailbreak attempts against a target model for prizes and public recognition. The LLM jailbreaking research community has developed a rich competitive ecosystem:

Indirect / agentic red teaming

Testing prompt injection attacks in agentic contexts — where an LLM takes actions on behalf of users. Attackers attempt to hijack the agent’s instructions via content it processes (emails, documents, web pages). See Prompt injection for the underlying vulnerability class.

Automated vulnerability scanning

LLM vulnerability scanning tools automate adversarial probing at scale — analogous to nmap or Metasploit for network security. They run probe libraries against any accessible model and report per-attack-category failure rates. garak (NVIDIA, Apache 2.0) is the leading open-source tool in this category, covering 20+ probe families including DAN attacks, encoding-based injection, GCG adversarial suffixes, package hallucination, training data replay, and more.

Developer-facing tools also bridge LLM evaluation and red-teaming: promptfoo (MIT, acquired by OpenAI) generates adversarial test suites as part of its eval workflow and integrates with CI/CD pipelines; Giskard Scan (Apache 2.0) generates adversarial suites automatically from a plain-language agent description, covering OWASP LLM Top-10 categories including prompt injection, harmful content, and stereotypes.

Challenge taxonomy in competitive red teaming

Competitive events typically score challenges by difficulty tier. HackAPrompt’s Pliny track uses 12 challenges of increasing difficulty. The CBRNE track focuses on eliciting chemical, biological, radiological, nuclear, and explosives information from production models — a metric directly relevant to AI alignment and responsible deployment policy.

What red-teaming reveals

Relationship to alignment

Red-teaming is an empirical validation layer for alignment claims. A model cannot be said to reliably refuse CBRN synthesis requests if no adversarial evaluation has been conducted. Anthropic’s ASL-3 deployment gate explicitly requires demonstrated red-team resistance before deployment of models with advanced CBRN capabilities.

Resources