AI Quality Metrics
Most teams ship AI features with no quality measurement in place. They track latency and uptime — infrastructure metrics — while having no visibility into whether the AI is doing its job well. This skill defines a quality metric system that connects model behaviour to user outcomes, and gives engineering and product a shared language for "is it working?"
---
Context
The three layers of AI quality (all must be measured):| Layer | What it measures | Who cares |
|---|---|---|
| Output quality | Is the AI's response good? | PM, users |
| System reliability | Is the AI available and fast? | Engineering |
| User impact | Is the AI changing user behaviour in the intended direction? | PM, business |
Most teams only measure layer 2. A complete quality system measures all three.
The quality metric trap: Optimising output quality metrics without tracking user impact metrics produces AI that scores well on rubrics but fails in product. Always pair output quality metrics with downstream user metrics.---
Step 1 — Define the feature context
Ask:
---
Step 2 — Select the output quality metrics
From this list, select the metrics that apply to the feature. Not all features need all metrics.
OUTPUT QUALITY METRICS
Accuracy
Definition: The proportion of outputs that match the correct answer when a ground truth exists.
When to use: Classification tasks, extraction tasks, Q&A with verifiable answers.
How to measure: Compare AI output to labelled ground truth. Report as %.
Target range: ≥ 90% for High stakes; ≥ 85% for Medium; ≥ 75% for Low.
Precision and Recall
Definition: Precision = of what the AI flagged, how much was actually correct. Recall = of
what was correct, how much did the AI catch?
When to use: Binary classification, content moderation, entity extraction.
How to measure: Confusion matrix on a labelled test set.
Target range: Set based on cost of false positives vs. false negatives for this feature.
Groundedness rate
Definition: % of outputs where every factual claim is traceable to the provided source.
When to use: RAG features, summarisation with source documents, factual Q&A.
How to measure: Human spot-check or automated citation verification.
Target range: ≥ 95% for High stakes features.
Format compliance rate
Definition: % of outputs that match the required output format exactly.
When to use: Any feature with structured output requirements.
How to measure: Automated parser — pass/fail.
Target range: 100%. Format failures are never acceptable at scale.
Refusal accuracy
Definition: % of refusals that were correct (true out-of-scope) vs. false refusals (incorrectly
declined valid requests).
When to use: Any feature with scope limits or safety filters.
How to measure: Sample-based review of all refusals.
Target range: False refusal rate ≤ 3%.
Coherence score
Definition: Does the output make logical sense end-to-end? (No contradictions, non-sequiturs,
or incomplete reasoning.)
When to use: Long-form generation, reasoning tasks, multi-step outputs.
How to measure: Human rubric (1–5 scale). See aipm-hitl-eval for rubric design.
Target range: ≥ 4.0 average.
---
Step 3 — Select the system reliability metrics
SYSTEM RELIABILITY METRICS
Latency (p50, p95, p99)
Definition: Time from request to first token / full response. Track all three percentiles —
p50 tells you the typical experience; p99 tells you the worst 1%.
Target range: Define by feature type — real-time features: p95 ≤ 2s; async features: p95 ≤ 10s.
Error rate
Definition: % of requests that return an error (model error, timeout, rate limit, malformed output).
Target range: ≤ 0.5% for customer-facing features.
Availability
Definition: % of time the AI feature is functional and accepting requests.
Target range: ≥ 99.5% for customer-facing; ≥ 99.0% for internal tools.
Token cost per output
Definition: Average token spend per successful output. Tracks cost efficiency.
When to use: Any feature at meaningful scale (>1k requests/day).
Target range: Set a budget ceiling and alert when exceeded.
Context window utilisation
Definition: % of the context window used on average per request.
When to use: Features that include large documents or long conversation history.
Target range: Alert when average exceeds 70% — approaching window limit impacts quality.
---
Step 4 — Select the user impact metrics
USER IMPACT METRICS
Task completion rate
Definition: % of users who complete the intended action after receiving AI output.
Example: For an AI that writes email drafts — % of drafts sent without major edit.
How to measure: Product analytics on the action that follows AI output use.
Output acceptance rate
Definition: % of AI outputs the user accepts vs. edits or discards.
When to use: Any copilot or draft-generation feature.
How to measure: Track accept/edit/discard events in the product.
Target range: > 60% acceptance rate is a healthy signal. Below 40% indicates quality problems.
User satisfaction (CSAT on AI outputs)
Definition: User rating of the AI output quality, collected in-product.
How to measure: Thumbs up / thumbs down on individual outputs, or a 1–5 star rating.
When to use: Any customer-facing AI feature.
Target range: > 80% thumbs-up rate.
Re-query rate
Definition: % of users who ask the same question or retry the same task after receiving
an AI output — indicating the output didn't satisfy them.
How to measure: Session analysis — same user, same intent, within 5 minutes.
Target range: < 15%. Above 25% is a quality signal to investigate.
AI-assisted feature engagement
Definition: Are users who engage with the AI feature more retained or activated than
those who don't?
How to measure: Segment users by AI feature engagement; compare retention/activation.
Why this matters: Proves the AI feature has product value, not just usage.
---
Step 5 — Build the metrics dashboard specification
Define the metrics, targets, owners, and reporting cadence in a single table:
AI QUALITY METRICS DASHBOARD: [Feature name]
As of: [date] Stakes level: [level] Owner: [PM name]
LAYER 1 — OUTPUT QUALITY
Metric | Target | Method | Cadence | Owner
[Metric 1] | [threshold] | [auto/human] | [weekly] | [PM]
[Metric 2] | [threshold] | [auto/human] | [weekly] | [Eng]
LAYER 2 — SYSTEM RELIABILITY
Metric | Target | Method | Cadence | Owner
Latency p95 | ≤ [N]ms | Automated | Real-time | Eng
Error rate | ≤ [N]% | Automated | Real-time | Eng
Availability | ≥ [N]% | Automated | Real-time | Eng
LAYER 3 — USER IMPACT
Metric | Target | Method | Cadence | Owner
[Metric 1] | [threshold] | Analytics | Weekly | PM
[Metric 2] | [threshold] | Analytics | Weekly | PM
ALERT THRESHOLDS (immediate action required):
Error rate > [X]% → page on-call engineer
Output acceptance rate drops > 10% week-over-week → PM review within 24 hours
Latency p95 exceeds [N]ms for > 30 mins → incident declared
CSAT drops below [X]% → PM + engineering review same day
---
Step 6 — Define the metric review process
METRIC REVIEW PROCESS
Daily (automated, no meeting):
Error rate and availability: automated alerts only
Latency p95: automated alerts if threshold breached
Weekly (PM reviews):
Output quality metrics: review sample-based eval results
User impact metrics: pull from analytics
Flag any metric that moved > 10% week-over-week
Write a 3-bullet weekly quality summary: [improving / stable / degrading] per layer
Monthly (PM + engineering sync):
Trend analysis: is each metric trending better, stable, or worse over 4 weeks?
Metric review: are the targets still the right targets?
Backlog prioritisation: which quality issues are now worth a sprint?
METRIC HEALTH SCORING:
Green: Within 10% of target, stable or improving
Amber: 10–25% off target, or trending worse for 2+ weeks
Red: > 25% off target, or a sudden drop > 20% in any week
---
Step 7 — Output the AI quality metrics document
# AI Quality Metrics: [Feature name]
Date: [date] Stakes level: [level] Owner: [PM name]
---
Selected metrics
Output quality
[Selected metrics with definitions, targets, and measurement method]
System reliability
[Selected metrics with targets]
User impact
[Selected metrics with targets]
Dashboard specification
[Full table from Step 5]
Alert thresholds
[List from Step 5]
Review process
[Cadence and ownership from Step 6]
Metric health legend
Green / Amber / Red definitions
Known gaps
[Metrics you want but can't measure yet — with what it would take to add them]
---
Quality check before delivering
---
Output
Deliver the complete AI quality metrics document.
Then add:
> Suggested next step: Before building dashboards, instrument the user impact metrics first. Output quality can be measured on demand with eval runs. User impact metrics require product analytics events — these take engineering time to add and are often deprioritised. Get them in the next sprint before the feature ships.