Prompt injection

Overview

Prompt injection is a class of security vulnerability in which an attacker embeds adversarial instructions inside content that an LLM is asked to process, causing the model to follow the attacker’s instructions rather than the legitimate user’s or system’s. The name is analogous to SQL injection: just as SQL injection smuggles database commands inside user-supplied data, prompt injection smuggles model instructions inside user-supplied content.

Prompt injection is distinct from direct jailbreaking (where the attacker interacts with the model directly) in that it is often indirect — the attacker does not have direct access to the model’s prompt but can influence it by controlling content the model ingests (documents, emails, web pages, tool outputs).

Attack variants

Direct prompt injection

The attacker directly instructs the model by including adversarial text in their own input, e.g. “Ignore previous instructions and do X.” This is the simplest form and is heavily defended against in production models via system prompts and safety training.

Indirect prompt injection

The attacker plants instructions in content the model will later process on behalf of a legitimate user. Examples:

A malicious webpage that tells a browsing agent to exfiltrate session cookies
A PDF document that instructs a summarization agent to append false information
An email that redirects an email-assistant agent to forward messages to an attacker’s address

Indirect prompt injection is significantly harder to defend against because the malicious content is often indistinguishable from legitimate content at the point of ingestion.

Stored prompt injection (persistent)

Malicious instructions stored in a database or external system that the model will query. Once stored, the attack affects every future model interaction that retrieves the contaminated record.

Encoding-based injection

Hiding injected instructions in text encodings (Base64, MIME, quoted-printable, ROT13, Unicode homoglyphs) that safety classifiers may not decode before passing to the model. garak’s encoding probe family systematically tests for this; it found that more recent ChatGPT variants were more susceptible to encoding-based injection than older models. See LLM vulnerability scanning for automated detection.

Why LLMs are vulnerable

LLMs do not have a clear architectural distinction between “instructions” (trusted) and “data” (untrusted). Both arrive as text in the same input stream. Safety training can teach models to ignore obvious override attempts but cannot reliably distinguish a legitimate instruction from a sophisticated injected one, especially when the injected instruction is crafted to look like a legitimate system message.

This vulnerability is structural, not merely a training deficiency. It is analogous to how early web frameworks could not prevent XSS without significant architectural change (output escaping, content security policies).

Why defenses fail: role confusion

Instruction hierarchy systems rely on role tags (<system>, <user>, <tool>) to assign privilege levels. However, models do not authenticate roles by their structural tags — they perceive roles by the style and content of the text. A message written to sound like a system instruction will be treated as one regardless of which tag wraps it. This is the root cause identified as role confusion in LLMs (Ye et al., 2026).

CoT forgery

CoT forgery (Ye et al., 2026) is a zero-shot variant that exploits the chain-of-thought role. The attack injects fabricated <think> / reasoning text into user messages or tool outputs. Because the model perceives this forged text as its own prior reasoning, it adopts the injected conclusions without scrutiny — even when the forged reasoning is transparently absurd. This yields ~60% attack success rate against frontier models on StrongREJECT, and 56–70% ASR in agent hijacking scenarios where standard injections achieved near-zero.

Agent hijacking

In agentic settings, indirect prompt injection becomes agent hijacking: malicious instructions embedded in web pages, emails, or tool results cause the agent to execute attacker-controlled actions (exfiltrating secrets, sending emails, modifying files). The threat surface scales with agent autonomy — a more capable agent with more tools represents higher hijacking risk.

Defenses and mitigations

Instruction hierarchy / privilege separation — model is trained to weight system-prompt instructions above user-prompt instructions, and user-prompt instructions above retrieved-content instructions
Input sanitization — scan retrieved content for injection patterns before feeding to the model (analogous to HTML escaping in web security)
Output monitoring — classify model outputs to detect signs of successful injection (e.g. unexpected actions, data exfiltration patterns)
Minimal tool permissions — limit what actions an agent can take so that a successful injection causes less damage (principle of least privilege)
Human-in-the-loop — require human approval for high-consequence actions, limiting the damage a successful injection can cause autonomously
Role probes — mechanistic interpretability tools (see role confusion in LLMs) that measure internal role perception; can detect role confusion before generation
Dual-LLM patterns — route tool outputs through a separate, restricted model that cannot execute instructions

No defense completely eliminates the risk; defense-in-depth is the recommended approach.

Relationship to LLM jailbreaking

Prompt injection is often listed as a jailbreak technique because injected instructions can be used to bypass safety training. However, the threat model differs: jailbreaking typically involves a user trying to get the model to help them; prompt injection typically involves a third-party attacker trying to hijack the model’s behavior in an agentic pipeline without the user’s knowledge.

Relationship to agentic AI

Prompt injection is the primary security concern for AI agents that operate with tool access and external data retrieval. As agents gain more autonomous capabilities (browsing, email, code execution, API calls), successful injection attacks can cause real-world harm far beyond generating harmful text.

Competitive red-teaming data

The HackAPrompt MATS x Trails competition track specifically tests indirect prompt injection in agentic settings. The broader Pliny HackAPrompt Dataset (16,902 submissions) includes prompt injection as one of its tagged attack categories and serves as an open benchmark for injection technique diversity. See Red teaming LLMs for the competitive red-teaming context.

Resources

2026-06-24 ◦ garak (GitHub) — automated scanner with dedicated encoding and promptinject probe families; the encoding probe revealed newer ChatGPT models are more susceptible to encoding-based injection than older variants
2026-06-24 ◦ Pliny HackAPrompt Dataset (HuggingFace) — open-sourced competitive jailbreak/prompt-injection submissions; 16,902 rows; includes direct and indirect injection techniques
HackAPrompt — competitive platform with dedicated indirect prompt injection track (MATS x Trails)
OWASP LLM Top 10 — prompt injection is ranked #1 in OWASP’s top security risks for LLM applications
2026-07-02 ◦ Prompt Injection as Role Confusion (Ye et al.) — traces prompt injection to role confusion: models perceive roles by text style not structural tags; introduces CoT Forgery (~60% ASR) and role probes as mechanistic interpretability tools