Human-in-the-Loop Eval Design

Automated evals catch format failures and binary errors. Human evaluators catch the things machines can't measure: tone that's technically correct but off-brand, reasoning that sounds right but is subtly wrong, outputs that satisfy the spec but miss the user's actual need. This skill designs the workflow, rubric, and feedback loop for human evaluation of AI outputs.

---

Context

When human eval is required (not optional):

Outputs that make factual claims users will act on

Outputs involving sensitive topics (health, legal, financial, safety)

Outputs where tone and trust are central to the product experience

Any feature where automated eval passes but user satisfaction is low

The three human eval roles:

Role	Who	What they judge
Domain expert	Subject matter expert	Factual accuracy, domain correctness
Product reviewer	PM or product team member	Alignment with product intent and voice
User proxy	Someone who represents the end user	Usability, clarity, trust, natural language quality

---

Step 1 — Define the eval scope

Ask: what outputs need review, highest-risk failure, volume, stakes, who is available to review, and how findings change the product.

Step 2 — Design the sampling strategy

Review all outputs when volume is low, feature is in beta, or a change was just deployed. Sample-based review in steady state: minimum sample = √(weekly volume), doubled for critical features.

Step 3 — Write the evaluation rubric

Three-point scale per dimension (Pass/Borderline/Fail) with specific descriptions and examples at each level. Required reviewer calibration: 10 pre-selected outputs scored together before independent scoring.

Step 4 — Build the review interface

Reviewer sees: original input, AI output, source document (if applicable), rubric, and scoring form. Minimum viable tooling: Airtable, Google Sheets, or purpose-built tools.

Step 5 — Define the escalation protocol

Safety dimension fail → immediate escalation. Fail rate > threshold → prompt review required. High fail rate → feature rollback consideration.

Step 6 — Build the feedback loop

Weekly findings → prompt improvement (3+ failures with same root cause = prompt update). Monthly findings → product decisions (persistent patterns → engineering ticket).

Quality check before delivering

Sampling strategy includes a numeric sample size

Rubric includes examples at each score level

Reviewer calibration process is defined

Escalation protocol has specific triggers, owners, and timeframes

Feedback loop connects findings to prompt changes AND product decisions

Safety failures have the most aggressive escalation path

Suggested next step: Before recruiting reviewers, complete the rubric calibration session yourself — score 10 outputs and write down your reasoning for each. This will reveal rubric gaps before a second reviewer finds them.