Back to library

AI Acceptance Criteria Generator

Traditional acceptance criteria are binary. AI outputs are probabilistic. This skill writes acceptance criteria that QA, engineering, and stakeholders can actually use — not "it feels right" but "here is what good, mediocre, and bad look like, and here is the threshold we ship at."

Context

The single most common reason AI features fail in production is that nobody defined what "good enough" meant before building started. This skill forces that conversation upfront.

Step 1 — Gather the feature context

Ask for the following if not already provided:

  • What does the AI feature do? One sentence output description
  • Who evaluates the output? End user / internal reviewer / automated system
  • What is the highest-stakes failure? What's the worst thing the AI could produce?
  • Do you have example outputs? Good ones, bad ones, or both — paste them if so
  • What volume of outputs will be produced? Per day / per user session
  • Is there a human review step before the user sees the output?
  • Step 2 — Define the three output tiers

    For every AI capability in the feature, write three explicit output examples:

    CAPABILITY: [What the AI is doing — e.g. "summarise a support ticket"]
    
    

    ✅ GOOD OUTPUT

    [Paste or write a real example of an excellent output]

    Why it's good: [what specific qualities make it pass]

    ⚠️ MEDIOCRE OUTPUT

    [Paste or write a borderline-acceptable output]

    Why it's borderline: [what's missing or slightly off]

    Decision: [ship with caveat / needs improvement / acceptable for v1]

    ❌ BAD OUTPUT

    [Paste or write a clearly unacceptable output]

    Why it fails: [specific failure type — see Step 3]

    Write one set of tiers per capability. If the feature has 3 capabilities, there are 3 sets.

    Step 3 — Categorise failure types

    Map every possible bad output to a failure category:

    Failure TypeDefinitionSeverity
    HallucinationOutput contains invented facts not in the inputCritical
    OmissionKey information from input is missing from outputHigh
    Tone violationOutput doesn't match the defined product voiceMedium
    Over-refusalSystem declines a valid, in-scope requestMedium
    Scope creepOutput goes beyond what was askedMedium
    Format failureOutput ignores length, structure, or formatting rulesLow–Med
    Latency failureResponse exceeds the acceptable wait timeHigh
    Bias / fairnessOutput treats similar inputs differently by groupCritical

    For each failure type relevant to this feature, define:

  • What it looks like specifically in this product
  • Whether a human or automated check catches it
  • What happens when it's caught (hide / flag / retry / fallback message)
  • Step 4 — Set the quality threshold

    Define the minimum bar for each quality dimension before shipping:

    Quality DimensionMinimum ThresholdHow Measured
    Accuracy / correctnesse.g. >90%human eval on N test cases / automated
    Format compliancee.g. 100%automated output parser
    Hallucination ratee.g. <2%human review of sample
    Refusal ratee.g. <5%automated logging
    Latency (p95)e.g. <3 secondsAPM tool
    User satisfactione.g. >4.0/5in-product thumbs / survey

    Step 5 — Write the test case set

    Write a minimum of 8 test cases covering:

    TEST CASE FORMAT:
    

    ID: TC-[number]

    Type: [Happy path / Edge case / Adversarial / Failure mode]

    Input: [Exact input to give the AI]

    Expected: [What the output should contain, be structured like, or avoid]

    Pass if: [Specific condition]

    Fail if: [Specific condition that triggers a fail]

    Required test case types (minimum coverage):
  • 2 × Happy path (typical, well-formed inputs)
  • 2 × Edge case (empty input, very long input, unusual format)
  • 2 × Adversarial (inputs designed to confuse, jailbreak, or mislead the AI)
  • 1 × Latency (test response time under expected load)
  • 1 × Failure handling (what happens when the AI produces a bad output)
  • Step 6 — Write the final acceptance criteria document

    Assemble into the template including: Output quality tiers, Failure type definitions, Quality thresholds, Test cases, Evaluation method, and Sign-off requirements.

    Quality check before delivering

    Every capability has a good / mediocre / bad example
    Every failure type has a defined handling action
    Quality thresholds are numeric, not descriptive
    At least 8 test cases are written
    Adversarial test cases are included
    Any TBD threshold is flagged as a launch blocker
    Suggested next step: Share this with engineering before sprint planning. If any threshold is TBD, schedule a 30-minute threshold-setting session using real output samples before the sprint begins.