Eval Framework Builder

An eval framework is the PM's answer to "how do we know this is good enough?" Without one, teams ship on vibes — the model feels right in demos but fails in production on inputs nobody tested. This skill builds a structured, repeatable eval framework before a feature ships, covering what to test, how to score it, and what the bar is for going live.

---

Context

The four layers of a complete eval framework:

Layer	What it tests	When to run
Functional evals	Does the output match the task?	Every build
Quality evals	Is the output good, not just correct?	Pre-launch + weekly
Adversarial evals	Does it fail safely under hostile inputs?	Pre-launch
Regression evals	Did a change break something that was working?	Every prompt/model change

---

Step 1 — Define the eval scope

Ask: what does the feature do, output type, good/bad output examples, volume, audience, and stakes level.

Step 2 — Define evaluation dimensions

Select relevant dimensions: accuracy, completeness, groundedness, faithfulness, format compliance, tone/voice, safety, latency, refusal rate. Define scoring method for each (binary, 3-point, 5-point, automated, human).

Step 3 — Build the test case set

Minimum counts by stakes level (Critical: 20+ functional, 10+ quality, 10+ adversarial). Four categories: functional, quality, adversarial, and regression.

Step 4 — Define the scoring rubric

For each scale-rated dimension: write explicit descriptions and examples for each score level. Include borderline guidance.

Step 5 — Set the ship threshold

Define minimum score per dimension with blocker status. All blocker dimensions must meet threshold before shipping.

Step 6 — Define the eval cadence

Pre-launch: full framework. Weekly: quality eval sample. On prompt change: functional + regression (blocking). On model upgrade: full re-run. On incident: full + root cause.

Quality check before delivering

Evaluation dimensions are specific to this feature

Test cases include adversarial inputs

Ship threshold is defined with blockers identified

Scoring rubrics have examples for each score level

Eval cadence covers post-launch, not just pre-launch

Regression case protocol is defined

Suggested next step: Run the framework now — before engineering finishes — so you know what the bar is. The worst time to define "good enough" is when engineering asks "are we done?"