AI Acceptance Criteria Generator

Traditional acceptance criteria are binary. AI outputs are probabilistic. This skill writes acceptance criteria that QA, engineering, and stakeholders can actually use — not "it feels right" but "here is what good, mediocre, and bad look like, and here is the threshold we ship at."

Context

The single most common reason AI features fail in production is that nobody defined what "good enough" meant before building started. This skill forces that conversation upfront.

Step 1 — Gather the feature context

Ask for the following if not already provided:

What does the AI feature do? One sentence output description

Who evaluates the output? End user / internal reviewer / automated system

What is the highest-stakes failure? What's the worst thing the AI could produce?

Do you have example outputs? Good ones, bad ones, or both — paste them if so

What volume of outputs will be produced? Per day / per user session

Is there a human review step before the user sees the output?

Step 2 — Define the three output tiers

For every AI capability in the feature, write three explicit output examples:

CAPABILITY: [What the AI is doing — e.g. "summarise a support ticket"]

✅ GOOD OUTPUT
[Paste or write a real example of an excellent output]
Why it's good: [what specific qualities make it pass]

⚠️ MEDIOCRE OUTPUT
[Paste or write a borderline-acceptable output]
Why it's borderline: [what's missing or slightly off]
Decision: [ship with caveat / needs improvement / acceptable for v1]

❌ BAD OUTPUT
[Paste or write a clearly unacceptable output]
Why it fails: [specific failure type — see Step 3]

Write one set of tiers per capability. If the feature has 3 capabilities, there are 3 sets.

Step 3 — Categorise failure types

Map every possible bad output to a failure category:

Failure Type	Definition	Severity
Hallucination	Output contains invented facts not in the input	Critical
Omission	Key information from input is missing from output	High
Tone violation	Output doesn't match the defined product voice	Medium
Over-refusal	System declines a valid, in-scope request	Medium
Scope creep	Output goes beyond what was asked	Medium
Format failure	Output ignores length, structure, or formatting rules	Low–Med
Latency failure	Response exceeds the acceptable wait time	High
Bias / fairness	Output treats similar inputs differently by group	Critical

For each failure type relevant to this feature, define:

What it looks like specifically in this product

Whether a human or automated check catches it

What happens when it's caught (hide / flag / retry / fallback message)

Step 4 — Set the quality threshold

Define the minimum bar for each quality dimension before shipping:

Quality Dimension	Minimum Threshold	How Measured
Accuracy / correctness	e.g. >90%	human eval on N test cases / automated
Format compliance	e.g. 100%	automated output parser
Hallucination rate	e.g. <2%	human review of sample
Refusal rate	e.g. <5%	automated logging
Latency (p95)	e.g. <3 seconds	APM tool
User satisfaction	e.g. >4.0/5	in-product thumbs / survey

Step 5 — Write the test case set

Write a minimum of 8 test cases covering:

TEST CASE FORMAT:
ID: TC-[number]
Type: [Happy path / Edge case / Adversarial / Failure mode]
Input: [Exact input to give the AI]
Expected: [What the output should contain, be structured like, or avoid]
Pass if: [Specific condition]
Fail if: [Specific condition that triggers a fail]

Required test case types (minimum coverage):

2 × Happy path (typical, well-formed inputs)

2 × Edge case (empty input, very long input, unusual format)

2 × Adversarial (inputs designed to confuse, jailbreak, or mislead the AI)

1 × Latency (test response time under expected load)

1 × Failure handling (what happens when the AI produces a bad output)

Step 6 — Write the final acceptance criteria document

Assemble into the template including: Output quality tiers, Failure type definitions, Quality thresholds, Test cases, Evaluation method, and Sign-off requirements.

Quality check before delivering

Every capability has a good / mediocre / bad example

Every failure type has a defined handling action

Quality thresholds are numeric, not descriptive

At least 8 test cases are written

Adversarial test cases are included

Any TBD threshold is flagged as a launch blocker

Suggested next step: Share this with engineering before sprint planning. If any threshold is TBD, schedule a 30-minute threshold-setting session using real output samples before the sprint begins.