Back to library

Eval Framework Builder

An eval framework is the PM's answer to "how do we know this is good enough?" Without one, teams ship on vibes — the model feels right in demos but fails in production on inputs nobody tested. This skill builds a structured, repeatable eval framework before a feature ships, covering what to test, how to score it, and what the bar is for going live.

---

Context

The four layers of a complete eval framework:
LayerWhat it testsWhen to run
Functional evalsDoes the output match the task?Every build
Quality evalsIs the output good, not just correct?Pre-launch + weekly
Adversarial evalsDoes it fail safely under hostile inputs?Pre-launch
Regression evalsDid a change break something that was working?Every prompt/model change

---

Step 1 — Define the eval scope

Ask: what does the feature do, output type, good/bad output examples, volume, audience, and stakes level.

Step 2 — Define evaluation dimensions

Select relevant dimensions: accuracy, completeness, groundedness, faithfulness, format compliance, tone/voice, safety, latency, refusal rate. Define scoring method for each (binary, 3-point, 5-point, automated, human).

Step 3 — Build the test case set

Minimum counts by stakes level (Critical: 20+ functional, 10+ quality, 10+ adversarial). Four categories: functional, quality, adversarial, and regression.

Step 4 — Define the scoring rubric

For each scale-rated dimension: write explicit descriptions and examples for each score level. Include borderline guidance.

Step 5 — Set the ship threshold

Define minimum score per dimension with blocker status. All blocker dimensions must meet threshold before shipping.

Step 6 — Define the eval cadence

Pre-launch: full framework. Weekly: quality eval sample. On prompt change: functional + regression (blocking). On model upgrade: full re-run. On incident: full + root cause.

Quality check before delivering

Evaluation dimensions are specific to this feature
Test cases include adversarial inputs
Ship threshold is defined with blockers identified
Scoring rubrics have examples for each score level
Eval cadence covers post-launch, not just pre-launch
Regression case protocol is defined
Suggested next step: Run the framework now — before engineering finishes — so you know what the bar is. The worst time to define "good enough" is when engineering asks "are we done?"