Human-in-the-Loop Eval Design
Automated evals catch format failures and binary errors. Human evaluators catch the things machines can't measure: tone that's technically correct but off-brand, reasoning that sounds right but is subtly wrong, outputs that satisfy the spec but miss the user's actual need. This skill designs the workflow, rubric, and feedback loop for human evaluation of AI outputs.
---
Context
When human eval is required (not optional):| Role | Who | What they judge |
|---|---|---|
| Domain expert | Subject matter expert | Factual accuracy, domain correctness |
| Product reviewer | PM or product team member | Alignment with product intent and voice |
| User proxy | Someone who represents the end user | Usability, clarity, trust, natural language quality |
---
Step 1 — Define the eval scope
Ask: what outputs need review, highest-risk failure, volume, stakes, who is available to review, and how findings change the product.
Step 2 — Design the sampling strategy
Review all outputs when volume is low, feature is in beta, or a change was just deployed. Sample-based review in steady state: minimum sample = √(weekly volume), doubled for critical features.
Step 3 — Write the evaluation rubric
Three-point scale per dimension (Pass/Borderline/Fail) with specific descriptions and examples at each level. Required reviewer calibration: 10 pre-selected outputs scored together before independent scoring.
Step 4 — Build the review interface
Reviewer sees: original input, AI output, source document (if applicable), rubric, and scoring form. Minimum viable tooling: Airtable, Google Sheets, or purpose-built tools.
Step 5 — Define the escalation protocol
Safety dimension fail → immediate escalation. Fail rate > threshold → prompt review required. High fail rate → feature rollback consideration.
Step 6 — Build the feedback loop
Weekly findings → prompt improvement (3+ failures with same root cause = prompt update). Monthly findings → product decisions (persistent patterns → engineering ticket).