Overview
LLM guardrails are programmable controls placed around a language model to restrict, shape, or validate its inputs and outputs at runtime. Unlike fine-tuning or alignment training (which alter model weights), guardrails are external mechanisms applied at inference time — they intercept the flow of data before it reaches the model, after it leaves, or at multiple points in between. Guardrails are the primary tool for making LLM-based applications production-safe: preventing jailbreaks, blocking prompt injection, enforcing topic scope, and ensuring outputs meet policy requirements.
NVIDIA’s NeMo Guardrails (open-source, Python) is the most prominent toolkit for this. It models guardrails as “rails” that fire at five distinct pipeline stages, defined in , the domain-specific language bundled with the toolkit.
Five rail types
NeMo Guardrails defines five pipeline positions where a rail can intercept:
| Rail type | Position | Example uses |
|---|---|---|
| Input rails | Applied to user message before the LLM sees it | Reject jailbreak attempts; rewrite ambiguous queries |
| Dialog rails | Control LLM prompting on canonical-form messages | Enforce predefined dialog paths; constrain topic scope |
| Retrieval rails | Applied to RAG chunks before they are used | Reject irrelevant or unsafe retrieved content |
| Execution rails | Applied to tool/action inputs and outputs | Validate API call parameters; filter tool results |
| Output rails | Applied to LLM response before it reaches the user | Moderate unsafe outputs; enforce language style |
Input rails
Operate before any LLM call. Use cases: detecting prompt injection, blocking disallowed topics, normalising user input. Can reject (return error) or transform (rewrite the message).
Dialog rails
Operate on a canonical representation of the conversation state — the LLM is prompted or constrained in how it continues. Used for scripted dialog paths: ensuring the bot follows a specific flow (e.g. always collect name before proceeding).
Retrieval rails
Specific to RAG pipelines. The retrieved chunks pass through the rail before being inserted into the LLM context. Use cases: filtering chunks that are off-topic, hallucination-prone, or policy-violating.
Execution rails
Applied to inputs sent to external tools (function calls, API invocations) and to their returned results. Provides a safety boundary around agentic tool use — see AI agents.
Output rails
The final checkpoint before the user sees a response. Use cases: PII detection, toxicity filtering, fact-checking against a source document, enforcing required disclaimers.
What guardrails protect against
- Jailbreaks — attempts to override system prompts or elicit disallowed content through roleplay, hypotheticals, or encoding tricks
- Prompt injection — attacker-controlled content in retrieved documents, tool outputs, or user messages that hijacks the model’s instructions
- Off-topic drift — the model being steered outside its intended domain by user manipulation or emergent conversation
- Unsafe outputs — harmful, toxic, or legally sensitive content generated by the model
NeMo Guardrails
NVIDIA’s open-source toolkit (github.com/NVIDIA-NeMo/Guardrails). Latest: v0.21.0. Python 3.10–3.13.
- Rails are defined in (`.co` files) and Python action handlers
- `LLMRails` Python class wraps any supported LLM; API is compatible with OpenAI Chat Completions
- Integrates with LangChain and supports GPT-3.5/4, LLaMA-2, Falcon, Vicuna, Mosaic
- Primary use cases: RAG applications (retrieval + output moderation), domain-specific chatbots, LLM API endpoints
Guardrails vs fine-tuning
Guardrails are a complement to, not a replacement for, alignment training:
- Fine-tuning shapes the model’s default tendencies — effective for tone and general safety
- Guardrails are deterministic and auditable — a rail either fires or it doesn’t; easier to debug and update
- Guardrails can be updated without retraining the model — critical for fast-changing policy requirements
- Guardrails add latency (extra LLM calls for classification) — fine-tuning does not
Related topics
- — the DSL used to define NeMo Guardrails rails and dialog flows
- AI agents — execution rails are the guardrail mechanism relevant to agentic tool use
- AI — broader landscape of LLM tooling
Resources
- 2026-06-23 ◦ NeMo Guardrails (GitHub) — NVIDIA’s open-source toolkit; five rail types (input, dialog, retrieval, execution, output); Colang DSL; LLMRails Python API; v0.21.0; protects against jailbreaks, prompt injection, off-topic responses, unsafe outputs