Overview

LLM evaluation (evals) is the systematic practice of measuring a language model or LLM-powered application against a defined set of test cases and criteria. Unlike deterministic unit tests, evals must handle non-deterministic outputs — the same input can produce multiple valid responses — so evaluation frameworks use a mix of assertion types: string/regex matching, semantic similarity, LLM-as-judge scoring, and human review. Evals serve both as quality gates during development (catching prompt regressions before deployment) and as benchmarks for choosing between models or configurations.

Evaluation frameworks typically separate concerns into: test-case definition, model execution, assertion/detection, and reporting. They can be run locally, as part of CI/CD pipelines, or as part of LLM vulnerability scanning workflows that use adversarial probes to find safety failures.

Core concepts

Assertion types

LLM-as-judge

Using a capable LLM (often GPT-4-class) to score or classify another model’s outputs. Common judge tasks: factual correctness, answer groundedness in retrieved context (for RAG), refusal appropriateness, style conformity. The Giskard library’s built-in judges include `Groundedness`, `Conformity`, and `LLMJudge`. Promptfoo supports LLM-as-judge via its `llm-rubric` assertion type.

Key caveat: judge models have their own biases and may favor outputs that stylistically resemble their own outputs.

Regression testing

Running the same eval suite before and after a model update, prompt change, or fine-tuning run to detect behavioral regressions. Frameworks like promptfoo integrate with CI/CD (GitHub Actions, GitLab CI) to fail builds when pass rates drop below thresholds.

Multi-turn / agentic evaluation

Testing full conversations or multi-step agent pipelines, not just single input-output pairs. Giskard v3’s Scenario API supports defining multi-turn interactions with `.interact()` chains. This is essential for evaluating agents that maintain state across turns or take external actions.

Tooling

promptfoo

Open-source CLI + library (Node.js/TypeScript, also `pip install promptfoo`). Developer-first, runs 100% locally (prompts never leave the machine). Acquired by OpenAI in 2025, remains MIT licensed.

Key features:

Giskard (giskard-checks)

Open-source Python library (Apache 2.0) for testing agentic systems. The `giskard-checks` package provides the Scenario API for defining eval suites; `giskard-scan` adds automated red teaming against OWASP LLM Top-10 categories.

Key features:

Relationship to red-teaming

Eval frameworks and red teaming lie on a spectrum. Standard evals test expected behaviors (does the model answer correctly?); red-team evals test failure modes (can an adversary make the model misbehave?). Tools like promptfoo and Giskard straddle both: they offer standard eval suites for quality and adversarial suites for security. LLM vulnerability scanning tools like garak focus exclusively on the adversarial side.

Resources