Back to library

Human-in-the-Loop Eval Design

Automated evals catch format failures and binary errors. Human evaluators catch the things machines can't measure: tone that's technically correct but off-brand, reasoning that sounds right but is subtly wrong, outputs that satisfy the spec but miss the user's actual need. This skill designs the workflow, rubric, and feedback loop for human evaluation of AI outputs.

---

Context

When human eval is required (not optional):
  • Outputs that make factual claims users will act on
  • Outputs involving sensitive topics (health, legal, financial, safety)
  • Outputs where tone and trust are central to the product experience
  • Any feature where automated eval passes but user satisfaction is low
  • The three human eval roles:
    RoleWhoWhat they judge
    Domain expertSubject matter expertFactual accuracy, domain correctness
    Product reviewerPM or product team memberAlignment with product intent and voice
    User proxySomeone who represents the end userUsability, clarity, trust, natural language quality

    ---

    Step 1 — Define the eval scope

    Ask: what outputs need review, highest-risk failure, volume, stakes, who is available to review, and how findings change the product.

    Step 2 — Design the sampling strategy

    Review all outputs when volume is low, feature is in beta, or a change was just deployed. Sample-based review in steady state: minimum sample = √(weekly volume), doubled for critical features.

    Step 3 — Write the evaluation rubric

    Three-point scale per dimension (Pass/Borderline/Fail) with specific descriptions and examples at each level. Required reviewer calibration: 10 pre-selected outputs scored together before independent scoring.

    Step 4 — Build the review interface

    Reviewer sees: original input, AI output, source document (if applicable), rubric, and scoring form. Minimum viable tooling: Airtable, Google Sheets, or purpose-built tools.

    Step 5 — Define the escalation protocol

    Safety dimension fail → immediate escalation. Fail rate > threshold → prompt review required. High fail rate → feature rollback consideration.

    Step 6 — Build the feedback loop

    Weekly findings → prompt improvement (3+ failures with same root cause = prompt update). Monthly findings → product decisions (persistent patterns → engineering ticket).

    Quality check before delivering

    Sampling strategy includes a numeric sample size
    Rubric includes examples at each score level
    Reviewer calibration process is defined
    Escalation protocol has specific triggers, owners, and timeframes
    Feedback loop connects findings to prompt changes AND product decisions
    Safety failures have the most aggressive escalation path
    Suggested next step: Before recruiting reviewers, complete the rubric calibration session yourself — score 10 outputs and write down your reasoning for each. This will reveal rubric gaps before a second reviewer finds them.