CC11: Evaluating Skills, Subagents & Prompts

Learning Objectives

Recognize the four-step eval workflow: dataset, runner, grader, dashboard.
Build a 20-case test dataset that catches typical, ambiguous, and adversarial inputs.
Decide between code-based and model-based grading; combine them when needed.
Write a model-based grader (LLM-as-judge) that produces stable, schema-validated scores.
Wire evals into a pre-commit hook and CI to block regressions.

CC2 showed prompt rewrites can lift accuracy 50%. Without an eval, you can't tell if your next rewrite helps or hurts. Evals turn "hope" into a number.

Why Eval, Why Now

Everyday Analogy

Imagine refactoring a 10K-line codebase with no tests. Every change is a roll of the dice. You move fast for a week, then a flaky bug surfaces in production, you don't know which commit caused it, and you spend three days bisecting.

The pain without a test suite: changes that work in your head break in subtle, distant ways. You start being scared to refactor.

Same with prompts. A skill that classifies commits great today might silently regress when you "improve" it tomorrow. Without a test suite for the prompt, the only feedback is angry users. Evals are the test suite.

Technical Definition

A prompt evaluation (or eval) is an automated test that runs a prompt against a set of inputs with known good outputs and produces a numerical score. The dataset is your unit-test fixtures; the grader is your assertion. A score of "84/100 cases correct" turns prompt engineering from vibes into science.

For Claude Code specifically

Skills, subagents, slash commands, and HTTP hooks are all prompts. As your team grows, multiple people edit them. Without evals, you'll get into "the commit-classifier subagent used to be more accurate" arguments with no way to settle them. Evals make prompts auditable.

The Eval Workflow — Four Stages

From hunch to confidence in four stages

1. DATASET

Test cases

Inputs + expected outputs (or rubric).

2. RUNNER

Execute

Loop dataset, run prompt on each input.

3. GRADER

Score

Compare actual vs expected. Code or LLM.

4. REPORT

Aggregate

Pass rate, regressions, per-case detail.

Each stage is a separate concern. Datasets evolve slowly. Runners are reusable. Graders are the hard part. Reports answer "did my last change help?"

Building a Test Dataset

Aim for 20–100 cases. Smaller is fine to start; larger if you can afford the model calls. Cover three slices:

Typical (60% of cases): the everyday inputs you expect.
Ambiguous (20%): inputs where reasonable humans disagree on the answer — dataset documents the disambiguation choice.
Adversarial (20%): inputs that have historically broken the prompt or push known weaknesses.

Storage format — JSONL

{"id": "t01", "input": "Add OAuth login endpoint",            "expected": "feat",     "tag": "typical"}
{"id": "t02", "input": "Bump axios from 1.6 to 1.7",           "expected": "chore",    "tag": "typical"}
{"id": "t03", "input": "Resolve null deref in pricing.ts",     "expected": "fix",      "tag": "typical"}
{"id": "a01", "input": "Refactor to add new pricing endpoint", "expected": "refactor", "tag": "ambiguous", "note": "Has both refactor and feat aspects; we treat structure-first as refactor."}
{"id": "x01", "input": "Improve",                              "expected": "chore",    "tag": "adversarial", "note": "Unhelpful message; default to chore."}

How to seed the dataset fast

Mine real usage logs. Pull 50 commits from your repo's history, hand-label them.
Have Claude generate candidates. Ask Claude for 30 plausible commit messages across categories; you label.
Capture failure cases as you find them. Every time the skill misfires in real use, add the case to the dataset before fixing.

Don't let Claude both write and grade

If you use Claude to generate test cases and Claude to grade them, you're measuring "Claude agrees with itself." That's not informative. Either you label the dataset, or your grader is code-based.

Code-Based Grading

Whenever the right answer is a determinable string, number, or shape, grade it with code. Cheap, deterministic, fast.

Exact match

def grade_exact(actual: str, expected: str) -> float:
    return 1.0 if actual.strip().lower() == expected.strip().lower() else 0.0

Regex / structural

import re, json

def grade_json_shape(actual: str, expected_keys: set) -> float:
    try:
        d = json.loads(actual)
    except Exception:
        return 0.0
    return 1.0 if set(d.keys()) == expected_keys else 0.0

def grade_verdict_format(actual: str) -> float:
    return 1.0 if re.match(r"^VERDICT: (ship|hold)\nREASON: .+$", actual.strip()) else 0.0

Composite scoring

For richer outputs, multiple checks weighted:

def grade_review(actual: str, expected: dict) -> float:
    score = 0.0
    if expected["category"] in actual.lower():           score += 0.4
    if "file:line" in actual or re.search(r"\.\w+:\d+", actual): score += 0.3
    if expected["severity"] in actual.lower():           score += 0.3
    return score

When code-based works

Classification, extraction, JSON shape checks, regex format checks, presence/absence of keywords, length bounds, "did Claude call the right tool?" Anything mechanical. Use it whenever you can — it's free, fast, and fully deterministic.

Model-Based Grading (LLM-as-Judge)

For subjective outputs — "is this a good explanation?", "is this code review thorough?" — code-based grading isn't enough. Use a Claude call as the judge, with a strict rubric.

Technical Definition

Model-based grading (LLM-as-judge) uses a separate Claude call to score a candidate output against a rubric and reference. The grader prompt locks the output as a forced tool call (CC8) producing {score: 0..5, reasoning: ...} — never as free text, since you'll aggregate the score.

A canonical judge

JUDGE_TOOL = [{
    "name": "score_output",
    "description": "Score a candidate output against a reference and rubric.",
    "input_schema": {
        "type": "object",
        "properties": {
            "score": {"type": "integer", "minimum": 0, "maximum": 5,
                      "description": "0=way off, 5=matches reference perfectly"},
            "reasoning": {"type": "string", "description": "1-2 sentence justification"},
        },
        "required": ["score", "reasoning"],
    },
}]

def llm_judge(case_input, candidate, reference, rubric):
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0,
        tools=JUDGE_TOOL,
        tool_choice={"type": "tool", "name": "score_output"},
        system=("You are a strict, terse grader. Apply the rubric exactly. "
                "Do NOT be generous. A score of 5 means 'identical to the reference "
                "for all practical purposes.' Most outputs are 2-3."),
        messages=[{"role": "user", "content": (
            f"<rubric>{rubric}</rubric>\n"
            f"<input>{case_input}</input>\n"
            f"<candidate>{candidate}</candidate>\n"
            f"<reference>{reference}</reference>"
        )}],
    )
    judgement = next(b.input for b in resp.content if b.type == "tool_use")
    return judgement["score"] / 5.0, judgement["reasoning"]

Five rules for stable judges

Use temperature 0. Judges that flip scores between runs are useless.
Use a smaller model than the candidate when possible — an Opus output graded by Sonnet is fine; the reverse risks the judge being outclassed.
Force the score via tool use — never parse free text.
Anchor with a reference. "Compare to reference" beats "is this good?"
Tell the judge to be strict. Default Claude is generous; spell out the calibration ("most outputs should score 2-3").

Validate your judge against humans

Before trusting model-based scores, hand-grade 20 cases yourself. Compare your scores to the judge's. If correlation is below ~0.8, fix the rubric, not the candidate prompt — a bad judge poisons all downstream conclusions.

A Complete Eval Runner

Putting it together — a 50-line runner you can adapt:

import json, statistics, time
from anthropic import Anthropic
from pathlib import Path

client = Anthropic()

def run_candidate(prompt_template: str, case: dict) -> str:
    """Run the prompt under test against one case input."""
    resp = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=128, temperature=0,
        system=prompt_template,
        messages=[{"role": "user", "content": case["input"]}],
    )
    return resp.content[0].text.strip()

def evaluate(prompt_template: str, dataset_path: str, grader) -> dict:
    cases = [json.loads(l) for l in Path(dataset_path).read_text().splitlines() if l.strip()]
    rows, t0 = [], time.time()
    for c in cases:
        actual = run_candidate(prompt_template, c)
        score = grader(actual, c["expected"])
        rows.append({"id": c["id"], "tag": c.get("tag",""),
                     "expected": c["expected"], "actual": actual, "score": score})
    elapsed = time.time() - t0
    overall = statistics.mean(r["score"] for r in rows)
    by_tag = {}
    for tag in {r["tag"] for r in rows}:
        s = [r["score"] for r in rows if r["tag"] == tag]
        by_tag[tag] = statistics.mean(s) if s else None
    return {"n": len(rows), "score": overall, "by_tag": by_tag,
            "elapsed_s": round(elapsed, 1), "rows": rows}

if __name__ == "__main__":
    PROMPT = Path("commit_classifier.system.txt").read_text()
    grader = lambda a, e: 1.0 if a.lower().strip() == e.lower() else 0.0
    result = evaluate(PROMPT, "commits.jsonl", grader)
    print(f"Overall: {result['score']:.2f} ({result['n']} cases, {result['elapsed_s']}s)")
    for tag, s in result["by_tag"].items():
        print(f"  {tag:<12} {s:.2f}")
    # Persist for diffing across runs
    Path("eval_runs").mkdir(exist_ok=True)
    Path(f"eval_runs/{int(time.time())}.json").write_text(json.dumps(result, indent=2))

The output that matters

$ python eval.py
Overall: 0.86 (50 cases, 41.2s)
  typical      0.93
  ambiguous    0.78
  adversarial  0.65

The breakdown by tag is the most useful number. Improvements should hold across all three tags — if "typical" jumps but "adversarial" drops, you've overfit.

Regression Gates & CI

An eval runner is only useful if it gates merges. Wire it in:

Pre-commit hook (local)

# .git/hooks/pre-commit
#!/usr/bin/env bash
set -e

# Only run if a prompt or subagent file changed
if git diff --cached --name-only | grep -qE '\.(system|prompt)\.txt$|^agents/|^\.claude/skills/'; then
  python evals/run.py 0.85 || {
    echo "Eval below threshold. Use 'git commit --no-verify' to override."
    exit 1
  }
fi

GitHub Actions (PR gate)

name: prompt-evals
on:
  pull_request:
    paths: ["prompts/**", "agents/**", ".claude/skills/**"]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install anthropic
      - name: Run evals (fail if overall < 0.85)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python evals/run.py 0.85

What "fail" should mean

Hard threshold: overall score must be ≥ X (e.g. 0.85). Useful for the first time.
Regression delta: overall score must not drop more than 5% vs main. Better long-term — allows quality to ratchet up.
Per-tag floor: no tag may drop below 0.6. Catches "improve typicals at cost of adversarials."

Cost of running evals

50 cases at 200 tokens each on Sonnet ~ $0.04 / run. CI runs that on every PR is cheap. If your dataset is 1000 cases or you use an Opus judge, costs add up — sample randomly per PR and run the full set on main.

Hands-On Lab — Add an Eval Gate to a Subagent

You'll take the commit-classifier-v1 subagent from CC2 and add a 20-case eval that runs in CI, blocking any change that drops accuracy more than 5%.

Step 1 — Create the dataset

mkdir -p evals
cat > evals/commits.jsonl <<'EOF'
{"id":"t01","input":"Add OAuth login endpoint","expected":"feat","tag":"typical"}
{"id":"t02","input":"Resolve null deref in pricing.ts","expected":"fix","tag":"typical"}
{"id":"t03","input":"Bump axios 1.6 -> 1.7","expected":"chore","tag":"typical"}
{"id":"t04","input":"Update README install steps","expected":"docs","tag":"typical"}
{"id":"t05","input":"Extract auth middleware to module","expected":"refactor","tag":"typical"}
{"id":"a01","input":"Refactor to add new pricing endpoint","expected":"refactor","tag":"ambiguous","note":"structure-first wins"}
{"id":"a02","input":"Fix typo and add caching layer","expected":"feat","tag":"ambiguous","note":"feat dominates"}
{"id":"x01","input":"Improve","expected":"chore","tag":"adversarial"}
{"id":"x02","input":"asdf","expected":"chore","tag":"adversarial"}
{"id":"x03","input":"feat: add docs","expected":"docs","tag":"adversarial","note":"prefix lies, body wins"}
EOF
# (add 10 more — aim for 20 total)

Step 2 — The runner

# evals/run.py
import json, sys, statistics
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()

PROMPT = Path("agents/commit-classifier-v1.md").read_text().split("---", 2)[-1]
DATA = [json.loads(l) for l in Path("evals/commits.jsonl").read_text().splitlines() if l.strip()]

def classify(msg: str) -> str:
    r = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=8, temperature=0,
        system=PROMPT, messages=[{"role":"user","content":msg}],
    )
    return r.content[0].text.strip().lower()

scores = []
for c in DATA:
    actual = classify(c["input"])
    ok = actual == c["expected"]
    scores.append({"id": c["id"], "tag": c["tag"], "ok": ok})
    print(f"{c['id']:<5} {c['tag']:<12} {'OK' if ok else 'FAIL'} got={actual} expected={c['expected']}")

overall = sum(s["ok"] for s in scores) / len(scores)
threshold = float(sys.argv[1]) if len(sys.argv) > 1 else 0.85
print(f"\nOverall: {overall:.2f}")
sys.exit(0 if overall >= threshold else 1)

Step 3 — Run it

$ python evals/run.py 0.85
t01   typical      OK got=feat expected=feat
t02   typical      OK got=fix expected=fix
...
Overall: 0.95
$ echo $?
0

Step 4 — Break the prompt to verify the gate

Edit the subagent and remove the <examples> block. Re-run.

$ python evals/run.py 0.85
...
Overall: 0.65
$ echo $?
1

The gate fires: removing examples broke 6 cases, score dropped to 0.65, exit code 1, CI fails. Restore the block; back to green.

Step 5 — Wire into GitHub Actions

D

Only run evals on Friday.

Correct. Sample-on-PR + full-on-main is the standard pattern. Pin the judge model, use Haiku for cheap classification graders, and run the slow set on a schedule.

Look again. Sample on PRs (fast feedback) + full set on main (deeper safety net) is the standard cost-balance pattern.

Module Summary

Four-stage workflow: dataset, runner, grader, report.
Datasets: 20–100 cases, mix of typical / ambiguous / adversarial. Tag every case.
Code-based grading first — exact match, regex, JSON shape, presence checks. Cheap and deterministic.
Model-based grading for subjective outputs. Use temp 0, force score via tool use, anchor with reference, calibrate strictness.
Validate your judge against humans before trusting it — correlate at ≥ 0.8 or fix the rubric.
Gate with both hard threshold and regression delta + per-tag floor. Sample on PRs, full set on main.
Wire evals into pre-commit and GitHub Actions. Cost is typically < $0.10 / run for small datasets — cheap insurance.