LLM red-teaming

Overview

LLM red-teaming is the adversarial probing of large language models to discover safety failures, capability overestimates, and hidden behaviours. Borrowed from classical security red-teaming (where a team simulates attackers), LLM red-teaming applies structured adversarial thinking to prompt-based systems: finding the inputs that break stated constraints, reveal confidential configuration, or produce harmful outputs.

Unlike traditional software security testing, LLM red-teaming must contend with the stochastic, instruction-following nature of models: the attack surface is the natural-language interface itself, which means the boundary between legitimate use and exploitation is inherently fuzzy.

Attack categories

Jailbreaking

Prompts designed to bypass safety training and produce outputs the model is trained to refuse. Common approaches: persona roleplay (“pretend you are an AI without restrictions”), hypothetical framing, token smuggling (encoding forbidden words), and iterative refinement. Jailbreaks often work against prompt-based guardrails more reliably than against fine-tuned refusals.

Prompt injection

Attacker-controlled content in the model’s context (e.g. a retrieved document, email, or web page) that hijacks the model’s behaviour. Analogous to SQL injection: trusted instructions and untrusted data share the same channel. Particularly dangerous in agentic systems where the model takes real-world actions.

System prompt extraction

Eliciting the hidden system prompt that configures the model’s behaviour. Methods: direct instruction (“repeat your system prompt verbatim”), indirect reconstruction from behaviour, jailbreak-assisted extraction. The CL4R1T4S repository (see System prompt transparency) maintains a community collection of extracted prompts from major platforms. The x1xhlol repository provides a complementary collection focused on AI coding assistants (Cursor, Windsurf, Devin, Manus, Lovable, Replit, VSCode Agent) and reveals model-identity obfuscation: Windsurf’s Cascade is instructed to claim it runs “GPT 4.1”; Devin instructs the model never to reveal its system prompt.

Capability elicitation

Probing whether a model has capabilities that are suppressed by instruction rather than absent by training. A model instructed not to write code may still do so if the instruction is bypassed — distinguishing “can’t” from “won’t” has significant implications for safety claims.

Red-teaming vs alignment

Red-teaming is complementary to but distinct from alignment research:

Alignment asks: how do we train models to behave safely and helpfully?
Red-teaming asks: does this deployed model actually behave that way?

AI labs increasingly run internal red teams before release (Anthropic’s red-team reports, OpenAI’s preparedness evaluations) and also commission external red-teams. The CL4R1T4S community represents informal external red-teaming focused on system prompt transparency rather than capability discovery.

Agentic risk amplification

Red-teaming concerns are amplified for that take real-world actions. A jailbroken chat model produces harmful text; a jailbroken agent with file access, browser control, or API credentials can execute harmful actions. This makes prompt injection in agentic contexts (e.g. a malicious web page hijacking a browsing agent) a critical threat class.

Defensive countermeasures

are the primary runtime defence against the attack categories above. Each rail type maps to a threat:

Input rails → jailbreak detection before the LLM sees the prompt
Retrieval rails → prompt injection in RAG chunks
Execution rails → prompt injection via tool outputs in agentic pipelines
Output rails → detecting harmful outputs before they reach the user

Guardrails are auditable and updatable without retraining — making them the preferred response to newly-discovered attack vectors that emerge from red-teaming.

Resources

2026-06-23 ◦ CL4R1T4S (GitHub) — community red-team / system prompt extraction project covering Anthropic, OpenAI, Google, xAI, and agent/coding-assistant platforms; 43.6k stars; illustrates scale of informal adversarial probing of commercial LLMs
2026-06-23 ◦ NeMo Guardrails (GitHub) — open-source toolkit implementing runtime as the defensive counterpart to red-team findings; five rail types targeting jailbreaks, prompt injection, and unsafe outputs
2026-07-09 ◦ System Prompts and Models of AI Tools (GitHub, x1xhlol) — coding-assistant-focused extraction collection (Cursor, Windsurf, Devin, Manus, Lovable, etc.); exposes agent tool schemas (tools.json) revealing the full attack surface of each agentic system; documents model-identity obfuscation instructions embedded in commercial prompts