Overview

LLM jailbreaking refers to adversarial inputs designed to bypass an AI model’s safety training and force it to produce harmful or prohibited outputs. Despite extensive safety fine-tuning, all known production LLMs remain vulnerable to some form of jailbreak. The field dates to at least 2013 in academic adversarial ML research; practical LLM jailbreaks became widespread with the public deployment of ChatGPT and Claude.

A universal jailbreak is a single prompting strategy that elicits harmful responses across an entire category of forbidden queries — the most dangerous class because it can be weaponized at scale without per-query customization.

Attack taxonomy

Input-side attacks (prompt manipulation)

Output-side attacks

Glitch token attacks

Certain tokens (sometimes called “glitch tokens”) trigger anomalous model behavior due to tokenizer edge cases — the model may repeat the token, output garbage, or ignore instructions. These can be used as a jailbreak vector or to destabilize safety classifiers.

Defenses

Model-level (in-weights)

Safety fine-tuning (RLHF, Constitutional AI, preference learning) teaches the model to refuse harmful requests. Effective but not sufficient: jailbreaks specifically circumvent these in-weights safety behaviors.

Classifier-level (external)

Input and output classifiers screen requests/completions independently of the model’s own safety training. See Constitutional classifiers for Anthropic’s approach of training these classifiers on constitution-derived synthetic data to achieve >95% jailbreak blocking with minimal overrefusal.

Operational / monitoring

Measurement challenges

Resources

Competitive red-teaming and datasets

The jailbreak research community has developed structured competitive events where participants attempt to break production LLMs for prizes. These generate large, diverse, human-authored attack corpora:

See Red teaming LLMs for the broader methodology and Prompt injection for the indirect injection sub-class.