LLM vulnerability scanning

Overview

LLM vulnerability scanning is the practice of systematically probing a language model with crafted inputs to identify failure modes, safety gaps, and exploitable weaknesses before or during deployment. Analogous to network vulnerability scanners like nmap or penetration-testing frameworks like Metasploit, an LLM vulnerability scanner automates the process of sending adversarial probes, collecting model outputs, and detecting whether the model exhibited an undesirable behavior. The field combines static probe libraries (known attack patterns), dynamic generation (adaptive prompts that react to model responses), and detector modules that classify outputs as safe or unsafe.

Unlike human red teaming, automated scanning trades creativity for scale: it can run thousands of probe variants overnight against any model accessible via API or locally, producing quantitative failure rates per attack category.

Architecture of a vulnerability scanner

A general-purpose LLM vulnerability scanner (exemplified by garak) is structured around five plugin categories:

Probes — classes that generate adversarial inputs targeting a specific vulnerability class (jailbreaks, prompt injection, hallucination elicitation, toxicity, data leakage, etc.)
Generators — adapters that connect the scanner to a target model (OpenAI API, Hugging Face Hub, AWS Bedrock, local gguf models, any REST endpoint)
Detectors — classifiers that evaluate model outputs and decide whether a probe attempt produced a “hit” (a failure)
Harnesses — orchestration logic that pairs probes with appropriate detectors and runs the scan
Evaluators — reporting and scoring components that aggregate results into per-probe and per-attack-category metrics

The default mode runs all known probes against a target; specific probe families or individual probes can be selected for targeted assessments.

Probe taxonomy

Common vulnerability categories covered by automated LLM scanners:

Safety and content policy

DAN and persona attacks — “Do Anything Now” and similar role-play jailbreaks that instruct the model to adopt an unrestricted persona
Continuation probes — completing a partially-provided harmful phrase
Misleading claims — testing if the model will support false or dangerous claims
Do-not-answer probes — requests responsible models should always refuse

Prompt injection

Encoding-based injection — hiding instructions in Base64, MIME, quoted-printable, or other encodings that safety classifiers may not decode
PromptInject — structured indirect injection (NeurIPS ML Safety Workshop 2022 best paper)
XSS probes — cross-site-scripting-style data exfiltration attempts
GCG (adversarial suffix) — appending optimised token sequences that disrupt system prompt adherence

Code and data safety

Package hallucination — eliciting code that imports non-existent packages, which attackers can register as malicious supply-chain packages; see
Malware generation — prompting the model to write functional malicious code
Training data replay — probing whether the model will regurgitate memorised private or copyrighted training data

Adversarial robustness

Bad characters — Unicode perturbations (invisible characters, homoglyphs, bidirectional overrides) that confuse safety classifiers
Glitch tokens — inputs that trigger anomalous model behavior due to tokenizer edge cases
Snowballed hallucination — leading the model into confident wrong answers on questions too complex for it to verify

Relationship to human red-teaming

Automated scanning and human red teaming are complementary:

Scanners provide coverage and repeatability; humans provide creativity and novelty. Novel jailbreak strategies often start human-discovered and are later encoded as scanner probes.
Scanners are appropriate for regression testing (did a model update break an existing defense?) and continuous integration pipelines.
Human red-teaming is better for discovering qualitatively new attack surfaces and for evaluating model behavior in open-ended real-world contexts.
LLM jailbreaking research generates the raw attack knowledge that scanner probe libraries encode.

Tooling

garak

NVIDIA’s open-source LLM vulnerability scanner (github.com/NVIDIA/garak). Key characteristics:

Command-line tool; install via pip install garak
Supports OpenAI, Hugging Face (local + API), AWS Bedrock, Replicate, Cohere, Groq, NIM, gguf/llama.cpp, any REST endpoint
20+ probe families covering jailbreaks, prompt injection, hallucination, toxicity, data leakage, malware, and more
Plugin architecture: probes, detectors, generators, harnesses, evaluators all independently extensible
Output: per-probe FAIL rates, detailed JSONL run logs, hit logs of successful exploits
Paper: arXiv:2406.11036 (Derczynski et al., 2024)

promptfoo

Open-source CLI + library (Node.js; also pip install promptfoo) that bridges LLM evaluation and red-teaming. Its red-team mode auto-generates adversarial test suites and produces security vulnerability reports; the same tool also handles standard LLM evaluation (model comparison, regression testing). Runs 100% locally; MIT licensed; acquired by OpenAI in 2025. Integrates with CI/CD (GitHub Actions) to fail builds on security regression.

Giskard Scan

Python package (pip install giskard-scan) from Giskard’s v3 modular architecture. Generates adversarial test suites from a plain-language agent description, covering OWASP LLM Top-10 threat categories: prompt injection, harmful content, stereotypes, misinformation, data leakage, and more. Supports custom ScenarioGenerator instances for extending probe coverage. Apache 2.0. Works alongside Giskard Checks (see LLM evaluation) in the same testing pipeline.

Resources

2026-06-24 ◦ garak (GitHub) — NVIDIA’s open-source LLM vulnerability scanner; Generative AI Red-teaming & Assessment Kit; 20+ probe families; plugin architecture; Apache 2.0
2026-06-24 ◦ garak paper (arXiv:2406.11036) — “garak: A Framework for Security Probing Large Language Models” (Derczynski et al., 2024)
garak documentation — user guide and reference docs
2026-06-24 ◦ promptfoo (GitHub) — CLI/library for LLM evals and red-teaming; generates adversarial vulnerability reports; CI/CD integration; MIT licensed; now part of OpenAI
2026-06-24 ◦ promptfoo red-teaming docs — guide to promptfoo’s automated red-teaming and vulnerability scanning workflow
2026-06-24 ◦ Giskard OSS (GitHub) — Python library for agentic system testing; giskard-scan covers OWASP LLM Top-10 categories; async-first; Apache 2.0