LLM evaluation

Overview

LLM evaluation (evals) is the systematic practice of measuring a language model or LLM-powered application against a defined set of test cases and criteria. Unlike deterministic unit tests, evals must handle non-deterministic outputs — the same input can produce multiple valid responses — so evaluation frameworks use a mix of assertion types: string/regex matching, semantic similarity, LLM-as-judge scoring, and human review. Evals serve both as quality gates during development (catching prompt regressions before deployment) and as benchmarks for choosing between models or configurations.

Evaluation frameworks typically separate concerns into: test-case definition, model execution, assertion/detection, and reporting. They can be run locally, as part of CI/CD pipelines, or as part of LLM vulnerability scanning workflows that use adversarial probes to find safety failures.

Core concepts

Assertion types

Exact/regex match — output must contain or equal a string pattern; fast and cheap but brittle for generative outputs
Semantic similarity — embedding-based cosine similarity against a reference answer; tolerates paraphrase but requires an embedding model
LLM-as-judge — a second LLM grades the output against criteria (correctness, groundedness, conformity, style); flexible but introduces model bias
Custom functions — arbitrary code assertions for domain-specific logic

LLM-as-judge

Using a capable LLM (often GPT-4-class) to score or classify another model’s outputs. Common judge tasks: factual correctness, answer groundedness in retrieved context (for RAG), refusal appropriateness, style conformity. The Giskard library’s built-in judges include `Groundedness`, `Conformity`, and `LLMJudge`. Promptfoo supports LLM-as-judge via its `llm-rubric` assertion type.

Key caveat: judge models have their own biases and may favor outputs that stylistically resemble their own outputs.

Regression testing

Running the same eval suite before and after a model update, prompt change, or fine-tuning run to detect behavioral regressions. Frameworks like promptfoo integrate with CI/CD (GitHub Actions, GitLab CI) to fail builds when pass rates drop below thresholds.

Multi-turn / agentic evaluation

Testing full conversations or multi-step agent pipelines, not just single input-output pairs. Giskard v3’s Scenario API supports defining multi-turn interactions with `.interact()` chains. This is essential for evaluating agents that maintain state across turns or take external actions.

Tooling

promptfoo

Open-source CLI + library (Node.js/TypeScript, also `pip install promptfoo`). Developer-first, runs 100% locally (prompts never leave the machine). Acquired by OpenAI in 2025, remains MIT licensed.

Key features:

YAML-based test-case definition
Side-by-side model comparison (OpenAI, Anthropic, Azure, Bedrock, Ollama, etc.)
Built-in assertion types plus LLM-as-judge
CI/CD integration with GitHub Actions
Red-teaming / LLM vulnerability scanning mode that generates adversarial test suites
Pull-request code scanning for LLM security issues
Powers LLM apps serving 10M+ users in production

Giskard (giskard-checks)

Open-source Python library (Apache 2.0) for testing agentic systems. The `giskard-checks` package provides the Scenario API for defining eval suites; `giskard-scan` adds automated red teaming against OWASP LLM Top-10 categories.