Eval Framework Builder
An eval framework is the PM's answer to "how do we know this is good enough?" Without one, teams ship on vibes — the model feels right in demos but fails in production on inputs nobody tested. This skill builds a structured, repeatable eval framework before a feature ships, covering what to test, how to score it, and what the bar is for going live.
---
Context
The four layers of a complete eval framework:| Layer | What it tests | When to run |
|---|---|---|
| Functional evals | Does the output match the task? | Every build |
| Quality evals | Is the output good, not just correct? | Pre-launch + weekly |
| Adversarial evals | Does it fail safely under hostile inputs? | Pre-launch |
| Regression evals | Did a change break something that was working? | Every prompt/model change |
---
Step 1 — Define the eval scope
Ask: what does the feature do, output type, good/bad output examples, volume, audience, and stakes level.
Step 2 — Define evaluation dimensions
Select relevant dimensions: accuracy, completeness, groundedness, faithfulness, format compliance, tone/voice, safety, latency, refusal rate. Define scoring method for each (binary, 3-point, 5-point, automated, human).
Step 3 — Build the test case set
Minimum counts by stakes level (Critical: 20+ functional, 10+ quality, 10+ adversarial). Four categories: functional, quality, adversarial, and regression.
Step 4 — Define the scoring rubric
For each scale-rated dimension: write explicit descriptions and examples for each score level. Include borderline guidance.
Step 5 — Set the ship threshold
Define minimum score per dimension with blocker status. All blocker dimensions must meet threshold before shipping.
Step 6 — Define the eval cadence
Pre-launch: full framework. Weekly: quality eval sample. On prompt change: functional + regression (blocking). On model upgrade: full re-run. On incident: full + root cause.