Overview

Prompt engineering patterns are reusable structural techniques for instructing language models — particularly cheap or small models (2–13B parameters) — to produce reliable, auditable output. Unlike raw prompting, patterns encode architectural decisions: how to divide cognitive labour across stages, how to prevent confirmation bias, how to make reasoning verifiable, and how to integrate deterministic tools into a probabilistic pipeline.

The domain is especially active in “cheap-model + Unix-tool” pipelines, where the model cannot hold large context windows or reason reliably from memory alone. In these settings, prompting patterns are the primary lever for quality control.

Multi-stage context injection

Split the pipeline into stages with distinct cognitive demands rather than passing everything to a single prompt:

Injecting Stage 1 output into Stage 2’s system prompt (not conversation history) keeps the cheap model’s effective context focused on the relevant signal. See also Planner-Generator-Evaluator pattern for a parallel decomposition in software development workflows.

Falsification-first prompting

Instruct the model to disprove its hypothesis rather than confirm it. Models are completion engines with inherent confirmation bias — they hallucinate coherent answers even when evidence is absent.

Effective prompt instruction:

Your goal is to DISPROVE the finding. Search for counter-evidence first. A defense you cannot find does not exist.

This is especially critical for interpreting empty tool results. When grep returns NO MATCHES, cheap models often hallucinate that a defense exists. Explicit instruction is required: “NO MATCHES means the pattern does NOT exist. Do not assume a defense is present.”

Three possible meanings of an empty grep result — the model must eliminate (B) and (C) before accepting (A):

Crux extraction

Force the model to emit the single key fact its verdict depends on:

1
2
3
4
5
{
  "reasoning": "...",
  "crux": "the memcpy at line 42 copies user-controlled length with no bounds check",
  "verdict": "VALID"
}

The crux field enables crux stability tracking: if the crux changes significantly between rounds (e.g. edit distance >0.3), the model’s reasoning is unstable — flag for human review. The orchestrator can also detect a stuck loop (same crux, oscillating verdicts) and terminate early or escalate. The crux also functions as a join key for cross-file memory.

Confidence-based model escalation

After a cheap model reaches its maximum rounds:

1
2
3
4
5
6
confidence = valid_rounds / total_rounds

confidence > 0.7  → VALID   (accept cheap model verdict)
confidence < 0.3  → INVALID (accept cheap model verdict)
0.3–0.7           → escalate: 1 round with a stronger model
still ambiguous   → INCONCLUSIVE

Early escalation signals (before max rounds):

This pattern keeps ~80% of cases resolved by cheap models; the remaining ~20% (genuinely ambiguous) get stronger reasoning. Related: for how tool budgets influence confidence calibration.

Structured context sections

Delimit different types of context with explicit section tags. Cheap models lose track of what kind of information they are reading without explicit demarcation.

1
2
3
4
5
[SYSTEM INSTRUCTIONS]
[FINDING UNDER REVIEW]
[PRIOR ROUND EVIDENCE]   ← condensed grep output from previous rounds
[CROSS-FILE CONTEXT]     ← related findings from other files
[USER QUERY / CURRENT TASK]

Without section tags, prior-round evidence bleeds into current reasoning unpredictably. Section tags also allow the orchestrator to selectively replace stale sections (e.g. condensing prior evidence) without restructuring the whole prompt.

Layered output parsing

Cheap models produce malformed structured output predictably. A robust parsing strategy proceeds in layers:

  1. Extract tool requests first (regex scan for GREP:, FIND:, etc.) — before attempting JSON parsing. Tool calls must not be missed even if surrounding JSON is malformed.
  2. Strip markdown fences.
  3. Remove trailing commas, normalize line breaks.
  4. Try json.loads().
  5. On failure: regex-extract individual JSON objects.
  6. Final fallback: INCONCLUSIVE with raw output preserved for human review.

Retry policy: send the model a correction message only for structural (syntactic) errors — missing braces, unclosed strings. Never retry semantic errors (wrong field names, wrong values) — cheap models rarely self-correct these and waste tokens. Max 1 retry.

Observability metrics

The most diagnostic metrics for a cheap-model pipeline:

Log as structured JSON per round: {finding_id, round, model, prompt_tokens, tool_calls[], crux_before, crux_after, confidence, latency_ms}.

Resources