LLM jailbreaking

Overview

LLM jailbreaking refers to adversarial inputs designed to bypass an AI model’s safety training and force it to produce harmful or prohibited outputs. Despite extensive safety fine-tuning, all known production LLMs remain vulnerable to some form of jailbreak. The field dates to at least 2013 in academic adversarial ML research; practical LLM jailbreaks became widespread with the public deployment of ChatGPT and Claude.

A universal jailbreak is a single prompting strategy that elicits harmful responses across an entire category of forbidden queries — the most dangerous class because it can be weaponized at scale without per-query customization.

Attack taxonomy

Input-side attacks (prompt manipulation)

Many-shot jailbreaking — flooding the context window with many examples of the desired harmful behavior until the model complies via in-context learning
Style manipulation — uSiNg UnUsUaL cApItALiZaTiOn or other typographic distortions that confuse safety classifiers while remaining human-readable
Cipher/encoding attacks — encoding harmful queries in Base64, ROT13, or invented ciphers so safety classifiers never see the plaintext
Role-play / persona injection — instructing the model to adopt a character (e.g. “DAN”, “Developer Mode”) that supposedly has no restrictions
Keyword substitution — replacing flagged terms with innocuous synonyms (e.g. “Soman” → “water” in a synthesis route)
Prompt injection — inserting adversarial instructions into content the model is asked to process (documents, web pages, tool outputs)
Language switching — requesting harmful information in a low-resource language whose safety training is weaker
GCG (adversarial suffix) — appending an optimised token sequence to a prompt that disrupts system-prompt adherence; discovered by Zou et al. (2023); automatable and transferable across models; covered by garak’s gcg probe

Output-side attacks

Continuation attacks — priming the model’s response with the beginning of a harmful completion and asking it to continue
Structured output exploitation — requesting harmful information formatted as code, JSON, or a table where safety training may be weaker

Glitch token attacks

Certain tokens (sometimes called “glitch tokens”) trigger anomalous model behavior due to tokenizer edge cases — the model may repeat the token, output garbage, or ignore instructions. These can be used as a jailbreak vector or to destabilize safety classifiers.

Defenses

Model-level (in-weights)

Safety fine-tuning (RLHF, Constitutional AI, preference learning) teaches the model to refuse harmful requests. Effective but not sufficient: jailbreaks specifically circumvent these in-weights safety behaviors.

Classifier-level (external)

Input and output classifiers screen requests/completions independently of the model’s own safety training. See Constitutional classifiers for Anthropic’s approach of training these classifiers on constitution-derived synthetic data to achieve >95% jailbreak blocking with minimal overrefusal.

Operational / monitoring

Rate limiting and anomaly detection on users who repeatedly trigger classifiers
Rapid-response classifier updates when new attacks are discovered in production
Logging and audit trails to identify systematic exploitation attempts

Measurement challenges

Jailbreak success is query-specific: a technique that works on one query may fail on another, making “universal” success hard to define and measure
Automated graders (LLM-as-judge) have non-trivial false-negative rates
Red-team conditions (automated grader feedback loop) are more attacker-favorable than real-world deployment conditions

Resources

2026-06-24 ◦ garak (GitHub) — automated LLM vulnerability scanner with 20+ probe families covering DAN attacks, encoding injection, GCG adversarial suffixes, glitch tokens, and more; see LLM vulnerability scanning
2025-02-03 ◦ Constitutional Classifiers (Anthropic) — Anthropic’s defense technique achieving 4.4% jailbreak success vs. 86% baseline; includes human red-teaming and public live demo results
2025-02-03 ◦ Constitutional Classifiers paper (arXiv) — full technical paper on classifier training, constitution design, and evaluation methodology
Many-shot jailbreaking (Anthropic) — context-window flooding attack; long prompts with many harmful examples override safety training
2026-06-24 ◦ Pliny HackAPrompt Dataset (HuggingFace) — 16,902 open-sourced jailbreak submissions from competitive red-teaming; covers 12 challenge types including text and image-based attacks; tags: prompt-injections, jailbreaks, redteaming; CC-BY-4.0

Competitive red-teaming and datasets

The jailbreak research community has developed structured competitive events where participants attempt to break production LLMs for prizes. These generate large, diverse, human-authored attack corpora:

HackAPrompt — competitive platform with >30,000 participants across 150+ countries; tracks include CBRNE elicitation, indirect prompt injection (MATS x Trails), and the Pliny track (named after the “Pliny the Liberator” jailbreak community)
Pliny HackAPrompt Dataset — 16,902 submissions from the Pliny track, open-sourced under CC-BY-4.0; one of the largest publicly available human-generated jailbreak corpora

See Red teaming LLMs for the broader methodology and Prompt injection for the indirect injection sub-class.

LLM jailbreaking

Overview

Attack taxonomy

Input-side attacks (prompt manipulation)

Output-side attacks

Glitch token attacks

Defenses

Model-level (in-weights)

Classifier-level (external)

Operational / monitoring

Measurement challenges

Resources

Competitive red-teaming and datasets

🔗 Links to this note (4)

Relationship to LLM jailbreaking

LLM jailbreaking

Overview

Attack taxonomy

Input-side attacks (prompt manipulation)

Output-side attacks

Glitch token attacks

Defenses

Model-level (in-weights)

Classifier-level (external)

Operational / monitoring

Measurement challenges

Resources

Competitive red-teaming and datasets

🔗 Links to this note (4)

Relationship to LLM jailbreaking

Quick Capture