LLM guardrails

Overview

LLM guardrails are programmable controls placed around a language model to restrict, shape, or validate its inputs and outputs at runtime. Unlike fine-tuning or alignment training (which alter model weights), guardrails are external mechanisms applied at inference time — they intercept the flow of data before it reaches the model, after it leaves, or at multiple points in between. Guardrails are the primary tool for making LLM-based applications production-safe: preventing jailbreaks, blocking prompt injection, enforcing topic scope, and ensuring outputs meet policy requirements.

NVIDIA’s NeMo Guardrails (open-source, Python) is the most prominent toolkit for this. It models guardrails as “rails” that fire at five distinct pipeline stages, defined in , the domain-specific language bundled with the toolkit.

Five rail types

NeMo Guardrails defines five pipeline positions where a rail can intercept:

Rail type	Position	Example uses
Input rails	Applied to user message before the LLM sees it	Reject jailbreak attempts; rewrite ambiguous queries
Dialog rails	Control LLM prompting on canonical-form messages	Enforce predefined dialog paths; constrain topic scope
Retrieval rails	Applied to RAG chunks before they are used	Reject irrelevant or unsafe retrieved content
Execution rails	Applied to tool/action inputs and outputs	Validate API call parameters; filter tool results
Output rails	Applied to LLM response before it reaches the user	Moderate unsafe outputs; enforce language style

Input rails

Operate before any LLM call. Use cases: detecting prompt injection, blocking disallowed topics, normalising user input. Can reject (return error) or transform (rewrite the message).

Dialog rails

Operate on a canonical representation of the conversation state — the LLM is prompted or constrained in how it continues. Used for scripted dialog paths: ensuring the bot follows a specific flow (e.g. always collect name before proceeding).

Retrieval rails

Specific to RAG pipelines. The retrieved chunks pass through the rail before being inserted into the LLM context. Use cases: filtering chunks that are off-topic, hallucination-prone, or policy-violating.

Execution rails

Applied to inputs sent to external tools (function calls, API invocations) and to their returned results. Provides a safety boundary around agentic tool use — see AI agents.

Output rails

The final checkpoint before the user sees a response. Use cases: PII detection, toxicity filtering, fact-checking against a source document, enforcing required disclaimers.

What guardrails protect against

Jailbreaks — attempts to override system prompts or elicit disallowed content through roleplay, hypotheticals, or encoding tricks
Prompt injection — attacker-controlled content in retrieved documents, tool outputs, or user messages that hijacks the model’s instructions
Off-topic drift — the model being steered outside its intended domain by user manipulation or emergent conversation
Unsafe outputs — harmful, toxic, or legally sensitive content generated by the model

NeMo Guardrails

NVIDIA’s open-source toolkit (github.com/NVIDIA-NeMo/Guardrails). Latest: v0.21.0. Python 3.10–3.13.

Rails are defined in (`.co` files) and Python action handlers
`LLMRails` Python class wraps any supported LLM; API is compatible with OpenAI Chat Completions
Integrates with LangChain and supports GPT-3.5/4, LLaMA-2, Falcon, Vicuna, Mosaic
Primary use cases: RAG applications (retrieval + output moderation), domain-specific chatbots, LLM API endpoints

Guardrails vs fine-tuning

Guardrails are a complement to, not a replacement for, alignment training:

Fine-tuning shapes the model’s default tendencies — effective for tone and general safety
Guardrails are deterministic and auditable — a rail either fires or it doesn’t; easier to debug and update
Guardrails can be updated without retraining the model — critical for fast-changing policy requirements
Guardrails add latency (extra LLM calls for classification) — fine-tuning does not

— the DSL used to define NeMo Guardrails rails and dialog flows
AI agents — execution rails are the guardrail mechanism relevant to agentic tool use
AI — broader landscape of LLM tooling

Resources

2026-06-23 ◦ NeMo Guardrails (GitHub) — NVIDIA’s open-source toolkit; five rail types (input, dialog, retrieval, execution, output); Colang DSL; LLMRails Python API; v0.21.0; protects against jailbreaks, prompt injection, off-topic responses, unsafe outputs