AI Acceptance Criteria Generator
Traditional acceptance criteria are binary. AI outputs are probabilistic. This skill writes acceptance criteria that QA, engineering, and stakeholders can actually use — not "it feels right" but "here is what good, mediocre, and bad look like, and here is the threshold we ship at."
Context
The single most common reason AI features fail in production is that nobody defined what "good enough" meant before building started. This skill forces that conversation upfront.
Step 1 — Gather the feature context
Ask for the following if not already provided:
Step 2 — Define the three output tiers
For every AI capability in the feature, write three explicit output examples:
CAPABILITY: [What the AI is doing — e.g. "summarise a support ticket"]
✅ GOOD OUTPUT
[Paste or write a real example of an excellent output]
Why it's good: [what specific qualities make it pass]
⚠️ MEDIOCRE OUTPUT
[Paste or write a borderline-acceptable output]
Why it's borderline: [what's missing or slightly off]
Decision: [ship with caveat / needs improvement / acceptable for v1]
❌ BAD OUTPUT
[Paste or write a clearly unacceptable output]
Why it fails: [specific failure type — see Step 3]
Write one set of tiers per capability. If the feature has 3 capabilities, there are 3 sets.
Step 3 — Categorise failure types
Map every possible bad output to a failure category:
| Failure Type | Definition | Severity |
|---|---|---|
| Hallucination | Output contains invented facts not in the input | Critical |
| Omission | Key information from input is missing from output | High |
| Tone violation | Output doesn't match the defined product voice | Medium |
| Over-refusal | System declines a valid, in-scope request | Medium |
| Scope creep | Output goes beyond what was asked | Medium |
| Format failure | Output ignores length, structure, or formatting rules | Low–Med |
| Latency failure | Response exceeds the acceptable wait time | High |
| Bias / fairness | Output treats similar inputs differently by group | Critical |
For each failure type relevant to this feature, define:
Step 4 — Set the quality threshold
Define the minimum bar for each quality dimension before shipping:
| Quality Dimension | Minimum Threshold | How Measured |
|---|---|---|
| Accuracy / correctness | e.g. >90% | human eval on N test cases / automated |
| Format compliance | e.g. 100% | automated output parser |
| Hallucination rate | e.g. <2% | human review of sample |
| Refusal rate | e.g. <5% | automated logging |
| Latency (p95) | e.g. <3 seconds | APM tool |
| User satisfaction | e.g. >4.0/5 | in-product thumbs / survey |
Step 5 — Write the test case set
Write a minimum of 8 test cases covering:
TEST CASE FORMAT:
ID: TC-[number]
Type: [Happy path / Edge case / Adversarial / Failure mode]
Input: [Exact input to give the AI]
Expected: [What the output should contain, be structured like, or avoid]
Pass if: [Specific condition]
Fail if: [Specific condition that triggers a fail]
Required test case types (minimum coverage):
Step 6 — Write the final acceptance criteria document
Assemble into the template including: Output quality tiers, Failure type definitions, Quality thresholds, Test cases, Evaluation method, and Sign-off requirements.