CC11: Evaluating Skills, Subagents & Prompts
Treat your skills, subagents, and prompts like code: have a test suite. This module covers building a test dataset, writing code-based and model-based graders, and turning evals into a CI gate so prompt regressions can't ship.
Learning Objectives
- Recognize the four-step eval workflow: dataset, runner, grader, dashboard.
- Build a 20-case test dataset that catches typical, ambiguous, and adversarial inputs.
- Decide between code-based and model-based grading; combine them when needed.
- Write a model-based grader (LLM-as-judge) that produces stable, schema-validated scores.
- Wire evals into a pre-commit hook and CI to block regressions.
Why Eval, Why Now
Imagine refactoring a 10K-line codebase with no tests. Every change is a roll of the dice. You move fast for a week, then a flaky bug surfaces in production, you don't know which commit caused it, and you spend three days bisecting.
The pain without a test suite: changes that work in your head break in subtle, distant ways. You start being scared to refactor.
Same with prompts. A skill that classifies commits great today might silently regress when you "improve" it tomorrow. Without a test suite for the prompt, the only feedback is angry users. Evals are the test suite.
A prompt evaluation (or eval) is an automated test that runs a prompt against a set of inputs with known good outputs and produces a numerical score. The dataset is your unit-test fixtures; the grader is your assertion. A score of "84/100 cases correct" turns prompt engineering from vibes into science.
Skills, subagents, slash commands, and HTTP hooks are all prompts. As your team grows, multiple people edit them. Without evals, you'll get into "the commit-classifier subagent used to be more accurate" arguments with no way to settle them. Evals make prompts auditable.
The Eval Workflow — Four Stages
Each stage is a separate concern. Datasets evolve slowly. Runners are reusable. Graders are the hard part. Reports answer "did my last change help?"
Building a Test Dataset
Aim for 20–100 cases. Smaller is fine to start; larger if you can afford the model calls. Cover three slices:
- Typical (60% of cases): the everyday inputs you expect.
- Ambiguous (20%): inputs where reasonable humans disagree on the answer — dataset documents the disambiguation choice.
- Adversarial (20%): inputs that have historically broken the prompt or push known weaknesses.
Storage format — JSONL
{"id": "t01", "input": "Add OAuth login endpoint", "expected": "feat", "tag": "typical"}
{"id": "t02", "input": "Bump axios from 1.6 to 1.7", "expected": "chore", "tag": "typical"}
{"id": "t03", "input": "Resolve null deref in pricing.ts", "expected": "fix", "tag": "typical"}
{"id": "a01", "input": "Refactor to add new pricing endpoint", "expected": "refactor", "tag": "ambiguous", "note": "Has both refactor and feat aspects; we treat structure-first as refactor."}
{"id": "x01", "input": "Improve", "expected": "chore", "tag": "adversarial", "note": "Unhelpful message; default to chore."}
How to seed the dataset fast
- Mine real usage logs. Pull 50 commits from your repo's history, hand-label them.
- Have Claude generate candidates. Ask Claude for 30 plausible commit messages across categories; you label.
- Capture failure cases as you find them. Every time the skill misfires in real use, add the case to the dataset before fixing.
If you use Claude to generate test cases and Claude to grade them, you're measuring "Claude agrees with itself." That's not informative. Either you label the dataset, or your grader is code-based.
Code-Based Grading
Whenever the right answer is a determinable string, number, or shape, grade it with code. Cheap, deterministic, fast.
Exact match
def grade_exact(actual: str, expected: str) -> float:
return 1.0 if actual.strip().lower() == expected.strip().lower() else 0.0
Regex / structural
import re, json
def grade_json_shape(actual: str, expected_keys: set) -> float:
try:
d = json.loads(actual)
except Exception:
return 0.0
return 1.0 if set(d.keys()) == expected_keys else 0.0
def grade_verdict_format(actual: str) -> float:
return 1.0 if re.match(r"^VERDICT: (ship|hold)\nREASON: .+$", actual.strip()) else 0.0
Composite scoring
For richer outputs, multiple checks weighted:
def grade_review(actual: str, expected: dict) -> float:
score = 0.0
if expected["category"] in actual.lower(): score += 0.4
if "file:line" in actual or re.search(r"\.\w+:\d+", actual): score += 0.3
if expected["severity"] in actual.lower(): score += 0.3
return score
Classification, extraction, JSON shape checks, regex format checks, presence/absence of keywords, length bounds, "did Claude call the right tool?" Anything mechanical. Use it whenever you can — it's free, fast, and fully deterministic.
Model-Based Grading (LLM-as-Judge)
For subjective outputs — "is this a good explanation?", "is this code review thorough?" — code-based grading isn't enough. Use a Claude call as the judge, with a strict rubric.
Model-based grading (LLM-as-judge) uses a separate Claude call to score a candidate output against a rubric and reference. The grader prompt locks the output as a forced tool call (CC8) producing {score: 0..5, reasoning: ...} — never as free text, since you'll aggregate the score.
A canonical judge
JUDGE_TOOL = [{
"name": "score_output",
"description": "Score a candidate output against a reference and rubric.",
"input_schema": {
"type": "object",
"properties": {
"score": {"type": "integer", "minimum": 0, "maximum": 5,
"description": "0=way off, 5=matches reference perfectly"},
"reasoning": {"type": "string", "description": "1-2 sentence justification"},
},
"required": ["score", "reasoning"],
},
}]
def llm_judge(case_input, candidate, reference, rubric):
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512, temperature=0,
tools=JUDGE_TOOL,
tool_choice={"type": "tool", "name": "score_output"},
system=("You are a strict, terse grader. Apply the rubric exactly. "
"Do NOT be generous. A score of 5 means 'identical to the reference "
"for all practical purposes.' Most outputs are 2-3."),
messages=[{"role": "user", "content": (
f"<rubric>{rubric}</rubric>\n"
f"<input>{case_input}</input>\n"
f"<candidate>{candidate}</candidate>\n"
f"<reference>{reference}</reference>"
)}],
)
judgement = next(b.input for b in resp.content if b.type == "tool_use")
return judgement["score"] / 5.0, judgement["reasoning"]
Five rules for stable judges
- Use temperature 0. Judges that flip scores between runs are useless.
- Use a smaller model than the candidate when possible — an Opus output graded by Sonnet is fine; the reverse risks the judge being outclassed.
- Force the score via tool use — never parse free text.
- Anchor with a reference. "Compare to reference" beats "is this good?"
- Tell the judge to be strict. Default Claude is generous; spell out the calibration ("most outputs should score 2-3").
Before trusting model-based scores, hand-grade 20 cases yourself. Compare your scores to the judge's. If correlation is below ~0.8, fix the rubric, not the candidate prompt — a bad judge poisons all downstream conclusions.
A Complete Eval Runner
Putting it together — a 50-line runner you can adapt:
import json, statistics, time
from anthropic import Anthropic
from pathlib import Path
client = Anthropic()
def run_candidate(prompt_template: str, case: dict) -> str:
"""Run the prompt under test against one case input."""
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=128, temperature=0,
system=prompt_template,
messages=[{"role": "user", "content": case["input"]}],
)
return resp.content[0].text.strip()
def evaluate(prompt_template: str, dataset_path: str, grader) -> dict:
cases = [json.loads(l) for l in Path(dataset_path).read_text().splitlines() if l.strip()]
rows, t0 = [], time.time()
for c in cases:
actual = run_candidate(prompt_template, c)
score = grader(actual, c["expected"])
rows.append({"id": c["id"], "tag": c.get("tag",""),
"expected": c["expected"], "actual": actual, "score": score})
elapsed = time.time() - t0
overall = statistics.mean(r["score"] for r in rows)
by_tag = {}
for tag in {r["tag"] for r in rows}:
s = [r["score"] for r in rows if r["tag"] == tag]
by_tag[tag] = statistics.mean(s) if s else None
return {"n": len(rows), "score": overall, "by_tag": by_tag,
"elapsed_s": round(elapsed, 1), "rows": rows}
if __name__ == "__main__":
PROMPT = Path("commit_classifier.system.txt").read_text()
grader = lambda a, e: 1.0 if a.lower().strip() == e.lower() else 0.0
result = evaluate(PROMPT, "commits.jsonl", grader)
print(f"Overall: {result['score']:.2f} ({result['n']} cases, {result['elapsed_s']}s)")
for tag, s in result["by_tag"].items():
print(f" {tag:<12} {s:.2f}")
# Persist for diffing across runs
Path("eval_runs").mkdir(exist_ok=True)
Path(f"eval_runs/{int(time.time())}.json").write_text(json.dumps(result, indent=2))
The output that matters
$ python eval.py
Overall: 0.86 (50 cases, 41.2s)
typical 0.93
ambiguous 0.78
adversarial 0.65
The breakdown by tag is the most useful number. Improvements should hold across all three tags — if "typical" jumps but "adversarial" drops, you've overfit.
Regression Gates & CI
An eval runner is only useful if it gates merges. Wire it in:
Pre-commit hook (local)
# .git/hooks/pre-commit
#!/usr/bin/env bash
set -e
# Only run if a prompt or subagent file changed
if git diff --cached --name-only | grep -qE '\.(system|prompt)\.txt$|^agents/|^\.claude/skills/'; then
python evals/run.py 0.85 || {
echo "Eval below threshold. Use 'git commit --no-verify' to override."
exit 1
}
fi
GitHub Actions (PR gate)
name: prompt-evals
on:
pull_request:
paths: ["prompts/**", "agents/**", ".claude/skills/**"]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install anthropic
- name: Run evals (fail if overall < 0.85)
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python evals/run.py 0.85
What "fail" should mean
- Hard threshold: overall score must be ≥ X (e.g. 0.85). Useful for the first time.
- Regression delta: overall score must not drop more than 5% vs
main. Better long-term — allows quality to ratchet up. - Per-tag floor: no tag may drop below 0.6. Catches "improve typicals at cost of adversarials."
50 cases at 200 tokens each on Sonnet ~ $0.04 / run. CI runs that on every PR is cheap. If your dataset is 1000 cases or you use an Opus judge, costs add up — sample randomly per PR and run the full set on main.
Hands-On Lab — Add an Eval Gate to a Subagent
You'll take the commit-classifier-v1 subagent from CC2 and add a 20-case eval that runs in CI, blocking any change that drops accuracy more than 5%.
Step 1 — Create the dataset
mkdir -p evals
cat > evals/commits.jsonl <<'EOF'
{"id":"t01","input":"Add OAuth login endpoint","expected":"feat","tag":"typical"}
{"id":"t02","input":"Resolve null deref in pricing.ts","expected":"fix","tag":"typical"}
{"id":"t03","input":"Bump axios 1.6 -> 1.7","expected":"chore","tag":"typical"}
{"id":"t04","input":"Update README install steps","expected":"docs","tag":"typical"}
{"id":"t05","input":"Extract auth middleware to module","expected":"refactor","tag":"typical"}
{"id":"a01","input":"Refactor to add new pricing endpoint","expected":"refactor","tag":"ambiguous","note":"structure-first wins"}
{"id":"a02","input":"Fix typo and add caching layer","expected":"feat","tag":"ambiguous","note":"feat dominates"}
{"id":"x01","input":"Improve","expected":"chore","tag":"adversarial"}
{"id":"x02","input":"asdf","expected":"chore","tag":"adversarial"}
{"id":"x03","input":"feat: add docs","expected":"docs","tag":"adversarial","note":"prefix lies, body wins"}
EOF
# (add 10 more — aim for 20 total)
Step 2 — The runner
# evals/run.py
import json, sys, statistics
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
PROMPT = Path("agents/commit-classifier-v1.md").read_text().split("---", 2)[-1]
DATA = [json.loads(l) for l in Path("evals/commits.jsonl").read_text().splitlines() if l.strip()]
def classify(msg: str) -> str:
r = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=8, temperature=0,
system=PROMPT, messages=[{"role":"user","content":msg}],
)
return r.content[0].text.strip().lower()
scores = []
for c in DATA:
actual = classify(c["input"])
ok = actual == c["expected"]
scores.append({"id": c["id"], "tag": c["tag"], "ok": ok})
print(f"{c['id']:<5} {c['tag']:<12} {'OK' if ok else 'FAIL'} got={actual} expected={c['expected']}")
overall = sum(s["ok"] for s in scores) / len(scores)
threshold = float(sys.argv[1]) if len(sys.argv) > 1 else 0.85
print(f"\nOverall: {overall:.2f}")
sys.exit(0 if overall >= threshold else 1)
Step 3 — Run it
$ python evals/run.py 0.85
t01 typical OK got=feat expected=feat
t02 typical OK got=fix expected=fix
...
Overall: 0.95
$ echo $?
0
Step 4 — Break the prompt to verify the gate
Edit the subagent and remove the <examples> block. Re-run.
$ python evals/run.py 0.85
...
Overall: 0.65
$ echo $?
1
The gate fires: removing examples broke 6 cases, score dropped to 0.65, exit code 1, CI fails. Restore the block; back to green.
Step 5 — Wire into GitHub Actions
Drop the YAML from the "Regression Gates" section into .github/workflows/prompt-evals.yml. Add ANTHROPIC_API_KEY to repo secrets. Push a PR that touches a prompt file — the eval runs and gates the merge.
A subagent under test, a 20-case dataset, a 30-line runner that gates on a threshold, and CI that blocks regressions. Your prompts are now refactor-safe. Combine this with CC10's prompt caching and you have a stable, observable, production-grade prompt pipeline.
Knowledge Check
1. You're picking a grader for a "is this commit message a feat/fix/chore/docs/refactor?" classifier. Best choice?
2. You're using Claude to generate test cases AND Claude to grade them. Why is this a problem?
3. Your model-based judge gives a score of 4.5 one run, 3.0 the next, on the same input. Why?
temperature=0 on the judge call.4. Your eval suite passes the overall threshold but the "adversarial" tag drops from 0.7 to 0.4. Should you ship?
5. Eval cost on every PR is hurting your CI budget. What's the best mitigation?
Module Summary
- Four-stage workflow: dataset, runner, grader, report.
- Datasets: 20–100 cases, mix of typical / ambiguous / adversarial. Tag every case.
- Code-based grading first — exact match, regex, JSON shape, presence checks. Cheap and deterministic.
- Model-based grading for subjective outputs. Use temp 0, force score via tool use, anchor with reference, calibrate strictness.
- Validate your judge against humans before trusting it — correlate at ≥ 0.8 or fix the rubric.
- Gate with both hard threshold and regression delta + per-tag floor. Sample on PRs, full set on main.
- Wire evals into pre-commit and GitHub Actions. Cost is typically < $0.10 / run for small datasets — cheap insurance.