Overview
LLM vulnerability scanning is the practice of systematically probing a language model with crafted inputs to identify failure modes, safety gaps, and exploitable weaknesses before or during deployment. Analogous to network vulnerability scanners like nmap or penetration-testing frameworks like Metasploit, an LLM vulnerability scanner automates the process of sending adversarial probes, collecting model outputs, and detecting whether the model exhibited an undesirable behavior. The field combines static probe libraries (known attack patterns), dynamic generation (adaptive prompts that react to model responses), and detector modules that classify outputs as safe or unsafe.
Unlike human red teaming, automated scanning trades creativity for scale: it can run thousands of probe variants overnight against any model accessible via API or locally, producing quantitative failure rates per attack category.
Architecture of a vulnerability scanner
A general-purpose LLM vulnerability scanner (exemplified by garak) is structured around five plugin categories:
- Probes — classes that generate adversarial inputs targeting a specific vulnerability class (jailbreaks, prompt injection, hallucination elicitation, toxicity, data leakage, etc.)
- Generators — adapters that connect the scanner to a target model (OpenAI API, Hugging Face Hub, AWS Bedrock, local gguf models, any REST endpoint)
- Detectors — classifiers that evaluate model outputs and decide whether a probe attempt produced a “hit” (a failure)
- Harnesses — orchestration logic that pairs probes with appropriate detectors and runs the scan
- Evaluators — reporting and scoring components that aggregate results into per-probe and per-attack-category metrics
The default mode runs all known probes against a target; specific probe families or individual probes can be selected for targeted assessments.
Probe taxonomy
Common vulnerability categories covered by automated LLM scanners:
Safety and content policy
- DAN and persona attacks — “Do Anything Now” and similar role-play jailbreaks that instruct the model to adopt an unrestricted persona
- Continuation probes — completing a partially-provided harmful phrase
- Misleading claims — testing if the model will support false or dangerous claims
- Do-not-answer probes — requests responsible models should always refuse
Prompt injection
- Encoding-based injection — hiding instructions in Base64, MIME, quoted-printable, or other encodings that safety classifiers may not decode
- PromptInject — structured indirect injection (NeurIPS ML Safety Workshop 2022 best paper)
- XSS probes — cross-site-scripting-style data exfiltration attempts
- GCG (adversarial suffix) — appending optimised token sequences that disrupt system prompt adherence
Code and data safety
- Package hallucination — eliciting code that imports non-existent packages, which attackers can register as malicious supply-chain packages; see
- Malware generation — prompting the model to write functional malicious code
- Training data replay — probing whether the model will regurgitate memorised private or copyrighted training data
Adversarial robustness
- Bad characters — Unicode perturbations (invisible characters, homoglyphs, bidirectional overrides) that confuse safety classifiers
- Glitch tokens — inputs that trigger anomalous model behavior due to tokenizer edge cases
- Snowballed hallucination — leading the model into confident wrong answers on questions too complex for it to verify
Relationship to human red-teaming
Automated scanning and human red teaming are complementary:
- Scanners provide coverage and repeatability; humans provide creativity and novelty. Novel jailbreak strategies often start human-discovered and are later encoded as scanner probes.
- Scanners are appropriate for regression testing (did a model update break an existing defense?) and continuous integration pipelines.
- Human red-teaming is better for discovering qualitatively new attack surfaces and for evaluating model behavior in open-ended real-world contexts.
- LLM jailbreaking research generates the raw attack knowledge that scanner probe libraries encode.
Tooling
garak
NVIDIA’s open-source LLM vulnerability scanner (github.com/NVIDIA/garak). Key characteristics:
- Command-line tool; install via
pip install garak - Supports OpenAI, Hugging Face (local + API), AWS Bedrock, Replicate, Cohere, Groq, NIM, gguf/llama.cpp, any REST endpoint
- 20+ probe families covering jailbreaks, prompt injection, hallucination, toxicity, data leakage, malware, and more
- Plugin architecture: probes, detectors, generators, harnesses, evaluators all independently extensible
- Output: per-probe FAIL rates, detailed JSONL run logs, hit logs of successful exploits
- Paper: arXiv:2406.11036 (Derczynski et al., 2024)
promptfoo
Open-source CLI + library (Node.js; also pip install promptfoo) that bridges
LLM evaluation and red-teaming. Its red-team mode auto-generates adversarial test
suites and produces security vulnerability reports; the same tool also handles
standard LLM evaluation (model comparison, regression testing). Runs 100% locally;
MIT licensed; acquired by OpenAI in 2025. Integrates with CI/CD (GitHub Actions)
to fail builds on security regression.
Giskard Scan
Python package (pip install giskard-scan) from Giskard’s v3 modular
architecture. Generates adversarial test suites from a plain-language agent
description, covering OWASP LLM Top-10 threat categories: prompt injection,
harmful content, stereotypes, misinformation, data leakage, and more. Supports
custom ScenarioGenerator instances for extending probe coverage. Apache 2.0.
Works alongside Giskard Checks (see LLM evaluation) in the same testing pipeline.
Resources
- 2026-06-24 ◦ garak (GitHub) — NVIDIA’s open-source LLM vulnerability scanner; Generative AI Red-teaming & Assessment Kit; 20+ probe families; plugin architecture; Apache 2.0
- 2026-06-24 ◦ garak paper (arXiv:2406.11036) — “garak: A Framework for Security Probing Large Language Models” (Derczynski et al., 2024)
- garak documentation — user guide and reference docs
- 2026-06-24 ◦ promptfoo (GitHub) — CLI/library for LLM evals and red-teaming; generates adversarial vulnerability reports; CI/CD integration; MIT licensed; now part of OpenAI
- 2026-06-24 ◦ promptfoo red-teaming docs — guide to promptfoo’s automated red-teaming and vulnerability scanning workflow
- 2026-06-24 ◦ Giskard OSS (GitHub) — Python library for agentic system testing; giskard-scan covers OWASP LLM Top-10 categories; async-first; Apache 2.0