Red teaming LLMs

Overview

Red teaming LLMs is the practice of systematically attempting to elicit harmful, unsafe, or policy-violating outputs from a language model in order to discover vulnerabilities before or during deployment. Borrowed from military and cybersecurity traditions, LLM red-teaming ranges from informal adversarial probing by individual researchers to structured competitive events with formal scoring and monetary prizes.

Unlike automated benchmarks, human red-teamers can discover qualitatively novel attack strategies that no benchmark anticipated. This makes red-teaming a critical complement to automated safety evaluations in alignment research.

Forms of red teaming

Internal (closed) red teaming

AI labs employ dedicated red teams who attempt to break the model before release. Anthropic’s prototype Constitutional Classifiers evaluation used 183 participants over >3,000 hours with up to $15,000 bounty for a confirmed universal jailbreak. No universal jailbreak was found in that round.

Public demo / bug bounty

A live model is exposed to the public for a fixed window. Anthropic’s Constitutional Classifiers public demo (Feb 3–10, 2025) attracted 339 participants who collectively spent ~3,700 hours on 8 graded chemical-weapons jailbreak challenges. The system held for five days; on days six and seven, 4 participants cleared all 8 levels (1 confirmed universal jailbreak). See Constitutional classifiers for full evaluation detail.

Competitive platforms

Open competitions where participants submit jailbreak attempts against a target model for prizes and public recognition. The LLM jailbreaking research community has developed a rich competitive ecosystem:

HackAPrompt (hackaprompt.com) — claims to be the world’s largest AI hacking platform; has distributed >$100,000 in prizes across multiple competitions
Pliny X HackAPrompt — competition track featuring 12 jailbreak challenges (9 text + 3 image-only) named after “Pliny the Liberator,” a prominent jailbreak community persona; submissions open-sourced as the Pliny HackAPrompt Dataset (16,902 rows, CC-BY-4.0)

Indirect / agentic red teaming

Testing prompt injection attacks in agentic contexts — where an LLM takes actions on behalf of users. Attackers attempt to hijack the agent’s instructions via content it processes (emails, documents, web pages). See Prompt injection for the underlying vulnerability class.

Automated vulnerability scanning

LLM vulnerability scanning tools automate adversarial probing at scale — analogous to nmap or Metasploit for network security. They run probe libraries against any accessible model and report per-attack-category failure rates. garak (NVIDIA, Apache 2.0) is the leading open-source tool in this category, covering 20+ probe families including DAN attacks, encoding-based injection, GCG adversarial suffixes, package hallucination, training data replay, and more.

Developer-facing tools also bridge LLM evaluation and red-teaming: promptfoo (MIT, acquired by OpenAI) generates adversarial test suites as part of its eval workflow and integrates with CI/CD pipelines; Giskard Scan (Apache 2.0) generates adversarial suites automatically from a plain-language agent description, covering OWASP LLM Top-10 categories including prompt injection, harmful content, and stereotypes.

Challenge taxonomy in competitive red teaming

Competitive events typically score challenges by difficulty tier. HackAPrompt’s Pliny track uses 12 challenges of increasing difficulty. The CBRNE track focuses on eliciting chemical, biological, radiological, nuclear, and explosives information from production models — a metric directly relevant to AI alignment and responsible deployment policy.

What red-teaming reveals

Which attack categories (cipher, role-play, many-shot, prompt injection) are most effective against current defenses
How quickly skilled adversaries find novel strategies vs. how long defenses hold
Attack rate as a function of time and number of participants — useful for estimating real-world exploit risk in deployment
Qualitative insights into model failure modes that automated evaluations miss

Relationship to alignment

Red-teaming is an empirical validation layer for alignment claims. A model cannot be said to reliably refuse CBRN synthesis requests if no adversarial evaluation has been conducted. Anthropic’s ASL-3 deployment gate explicitly requires demonstrated red-team resistance before deployment of models with advanced CBRN capabilities.

Resources

2026-06-24 ◦ garak (GitHub) — NVIDIA’s open-source LLM vulnerability scanner; Generative AI Red-teaming & Assessment Kit; 20+ probe families; plugin architecture; Apache 2.0
2026-06-24 ◦ Pliny HackAPrompt Dataset (HuggingFace) — 16,902 open-sourced jailbreak submissions from the Pliny X HackAPrompt competitive red-teaming event; CC-BY-4.0; tags: redteaming, safety, prompt-injections, jailbreaks
2025-02-03 ◦ Constitutional Classifiers (Anthropic) — formal internal and public red-team results; 183 internal + 339 public participants, jailbreak rate reduced from 86% to 4.4%
HackAPrompt — competitive AI red-teaming platform; runs CBRNE, Pliny, Tutorial, and agentic (MATS x Trails) tracks
2026-06-24 ◦ promptfoo (GitHub) — open-source CLI and library for LLM evals and automated red-teaming; generates adversarial test suites; CI/CD integration; MIT licensed; acquired by OpenAI
2026-06-24 ◦ Giskard Scan (GitHub) — Python vulnerability scanner for agentic systems; auto-generates adversarial suites from agent descriptions; covers OWASP LLM Top-10; Apache 2.0