AI Quality Metrics

Most teams ship AI features with no quality measurement in place. They track latency and uptime — infrastructure metrics — while having no visibility into whether the AI is doing its job well. This skill defines a quality metric system that connects model behaviour to user outcomes, and gives engineering and product a shared language for "is it working?"

---

Context

The three layers of AI quality (all must be measured):

Layer	What it measures	Who cares
Output quality	Is the AI's response good?	PM, users
System reliability	Is the AI available and fast?	Engineering
User impact	Is the AI changing user behaviour in the intended direction?	PM, business

Most teams only measure layer 2. A complete quality system measures all three.

The quality metric trap: Optimising output quality metrics without tracking user impact metrics produces AI that scores well on rubrics but fails in product. Always pair output quality metrics with downstream user metrics.

---

Step 1 — Define the feature context

Ask:

What does the AI feature do? One sentence

What is the primary output type? (Classification / Free text / Structured data / Score / Decision)

Who uses the output? (End user directly / Internal team / Downstream system)

What action does a user take based on the AI output? (This is the user impact anchor)

What does a failure look like from the user's perspective? (Not from the model's perspective)

---

Step 2 — Select the output quality metrics

From this list, select the metrics that apply to the feature. Not all features need all metrics.

OUTPUT QUALITY METRICS Accuracy Definition: The proportion of outputs that match the correct answer when a ground truth exists. When to use: Classification tasks, extraction tasks, Q&A with verifiable answers. How to measure: Compare AI output to labelled ground truth. Report as %. Target range: ≥ 90% for High stakes; ≥ 85% for Medium; ≥ 75% for Low. Precision and Recall Definition: Precision = of what the AI flagged, how much was actually correct. Recall = of what was correct, how much did the AI catch? When to use: Binary classification, content moderation, entity extraction. How to measure: Confusion matrix on a labelled test set. Target range: Set based on cost of false positives vs. false negatives for this feature. Groundedness rate Definition: % of outputs where every factual claim is traceable to the provided source. When to use: RAG features, summarisation with source documents, factual Q&A. How to measure: Human spot-check or automated citation verification. Target range: ≥ 95% for High stakes features. Format compliance rate Definition: % of outputs that match the required output format exactly. When to use: Any feature with structured output requirements. How to measure: Automated parser — pass/fail. Target range: 100%. Format failures are never acceptable at scale. Refusal accuracy Definition: % of refusals that were correct (true out-of-scope) vs. false refusals (incorrectly declined valid requests). When to use: Any feature with scope limits or safety filters. How to measure: Sample-based review of all refusals. Target range: False refusal rate ≤ 3%. Coherence score Definition: Does the output make logical sense end-to-end? (No contradictions, non-sequiturs, or incomplete reasoning.) When to use: Long-form generation, reasoning tasks, multi-step outputs. How to measure: Human rubric (1–5 scale). See aipm-hitl-eval for rubric design.

Target range: ≥ 4.0 average.

---

Step 3 — Select the system reliability metrics

SYSTEM RELIABILITY METRICS Latency (p50, p95, p99) Definition: Time from request to first token / full response. Track all three percentiles — p50 tells you the typical experience; p99 tells you the worst 1%. Target range: Define by feature type — real-time features: p95 ≤ 2s; async features: p95 ≤ 10s. Error rate Definition: % of requests that return an error (model error, timeout, rate limit, malformed output). Target range: ≤ 0.5% for customer-facing features. Availability Definition: % of time the AI feature is functional and accepting requests. Target range: ≥ 99.5% for customer-facing; ≥ 99.0% for internal tools. Token cost per output Definition: Average token spend per successful output. Tracks cost efficiency. When to use: Any feature at meaningful scale (>1k requests/day). Target range: Set a budget ceiling and alert when exceeded. Context window utilisation Definition: % of the context window used on average per request. When to use: Features that include large documents or long conversation history.

Target range: Alert when average exceeds 70% — approaching window limit impacts quality.

---

Step 4 — Select the user impact metrics

USER IMPACT METRICS Task completion rate Definition: % of users who complete the intended action after receiving AI output. Example: For an AI that writes email drafts — % of drafts sent without major edit. How to measure: Product analytics on the action that follows AI output use. Output acceptance rate Definition: % of AI outputs the user accepts vs. edits or discards. When to use: Any copilot or draft-generation feature. How to measure: Track accept/edit/discard events in the product. Target range: > 60% acceptance rate is a healthy signal. Below 40% indicates quality problems. User satisfaction (CSAT on AI outputs) Definition: User rating of the AI output quality, collected in-product. How to measure: Thumbs up / thumbs down on individual outputs, or a 1–5 star rating. When to use: Any customer-facing AI feature. Target range: > 80% thumbs-up rate. Re-query rate Definition: % of users who ask the same question or retry the same task after receiving an AI output — indicating the output didn't satisfy them. How to measure: Session analysis — same user, same intent, within 5 minutes. Target range: < 15%. Above 25% is a quality signal to investigate. AI-assisted feature engagement Definition: Are users who engage with the AI feature more retained or activated than those who don't? How to measure: Segment users by AI feature engagement; compare retention/activation.

Why this matters: Proves the AI feature has product value, not just usage.

---

Step 5 — Build the metrics dashboard specification

Define the metrics, targets, owners, and reporting cadence in a single table:

AI QUALITY METRICS DASHBOARD: [Feature name]
As of: [date]  Stakes level: [level]  Owner: [PM name]

LAYER 1 — OUTPUT QUALITY
Metric              | Target      | Method     | Cadence | Owner
[Metric 1]          | [threshold] | [auto/human] | [weekly] | [PM]
[Metric 2]          | [threshold] | [auto/human] | [weekly] | [Eng]

LAYER 2 — SYSTEM RELIABILITY
Metric              | Target      | Method     | Cadence | Owner
Latency p95         | ≤ [N]ms    | Automated  | Real-time | Eng
Error rate          | ≤ [N]%     | Automated  | Real-time | Eng
Availability        | ≥ [N]%     | Automated  | Real-time | Eng

LAYER 3 — USER IMPACT
Metric              | Target      | Method     | Cadence | Owner
[Metric 1]          | [threshold] | Analytics  | Weekly   | PM
[Metric 2]          | [threshold] | Analytics  | Weekly   | PM

ALERT THRESHOLDS (immediate action required):
Error rate > [X]% → page on-call engineer
Output acceptance rate drops > 10% week-over-week → PM review within 24 hours
Latency p95 exceeds [N]ms for > 30 mins → incident declared
CSAT drops below [X]% → PM + engineering review same day

---

Step 6 — Define the metric review process

METRIC REVIEW PROCESS Daily (automated, no meeting): Error rate and availability: automated alerts only Latency p95: automated alerts if threshold breached Weekly (PM reviews): Output quality metrics: review sample-based eval results User impact metrics: pull from analytics Flag any metric that moved > 10% week-over-week Write a 3-bullet weekly quality summary: [improving / stable / degrading] per layer Monthly (PM + engineering sync): Trend analysis: is each metric trending better, stable, or worse over 4 weeks? Metric review: are the targets still the right targets? Backlog prioritisation: which quality issues are now worth a sprint? METRIC HEALTH SCORING: Green: Within 10% of target, stable or improving Amber: 10–25% off target, or trending worse for 2+ weeks

Red: > 25% off target, or a sudden drop > 20% in any week

---

Step 7 — Output the AI quality metrics document

# AI Quality Metrics: [Feature name]
Date: [date]  Stakes level: [level]  Owner: [PM name]

---

Selected metrics

Output quality
[Selected metrics with definitions, targets, and measurement method]

System reliability
[Selected metrics with targets]

User impact
[Selected metrics with targets]

Dashboard specification
[Full table from Step 5]

Alert thresholds
[List from Step 5]

Review process
[Cadence and ownership from Step 6]

Metric health legend
Green / Amber / Red definitions

Known gaps
[Metrics you want but can't measure yet — with what it would take to add them]

---

Quality check before delivering

All three layers are represented — not just reliability metrics

Every metric has a defined target — not "we'll see what's reasonable"

User impact metrics are included — not just output quality

Alert thresholds are defined for the most critical metrics

Measurement method is specified — automated vs. human vs. analytics

Review cadence assigns an owner to each layer

---

Output

Deliver the complete AI quality metrics document.

Then add:

> Suggested next step: Before building dashboards, instrument the user impact metrics first. Output quality can be measured on demand with eval runs. User impact metrics require product analytics events — these take engineering time to add and are often deprioritised. Get them in the next sprint before the feature ships.