Back to library

AI Quality Metrics

Most teams ship AI features with no quality measurement in place. They track latency and uptime — infrastructure metrics — while having no visibility into whether the AI is doing its job well. This skill defines a quality metric system that connects model behaviour to user outcomes, and gives engineering and product a shared language for "is it working?"

---

Context

The three layers of AI quality (all must be measured):
LayerWhat it measuresWho cares
Output qualityIs the AI's response good?PM, users
System reliabilityIs the AI available and fast?Engineering
User impactIs the AI changing user behaviour in the intended direction?PM, business

Most teams only measure layer 2. A complete quality system measures all three.

The quality metric trap: Optimising output quality metrics without tracking user impact metrics produces AI that scores well on rubrics but fails in product. Always pair output quality metrics with downstream user metrics.

---

Step 1 — Define the feature context

Ask:

  • What does the AI feature do? One sentence
  • What is the primary output type? (Classification / Free text / Structured data / Score / Decision)
  • Who uses the output? (End user directly / Internal team / Downstream system)
  • What action does a user take based on the AI output? (This is the user impact anchor)
  • What does a failure look like from the user's perspective? (Not from the model's perspective)
  • ---

    Step 2 — Select the output quality metrics

    From this list, select the metrics that apply to the feature. Not all features need all metrics.

    OUTPUT QUALITY METRICS
    
    

    Accuracy

    Definition: The proportion of outputs that match the correct answer when a ground truth exists.

    When to use: Classification tasks, extraction tasks, Q&A with verifiable answers.

    How to measure: Compare AI output to labelled ground truth. Report as %.

    Target range: ≥ 90% for High stakes; ≥ 85% for Medium; ≥ 75% for Low.

    Precision and Recall

    Definition: Precision = of what the AI flagged, how much was actually correct. Recall = of

    what was correct, how much did the AI catch?

    When to use: Binary classification, content moderation, entity extraction.

    How to measure: Confusion matrix on a labelled test set.

    Target range: Set based on cost of false positives vs. false negatives for this feature.

    Groundedness rate

    Definition: % of outputs where every factual claim is traceable to the provided source.

    When to use: RAG features, summarisation with source documents, factual Q&A.

    How to measure: Human spot-check or automated citation verification.

    Target range: ≥ 95% for High stakes features.

    Format compliance rate

    Definition: % of outputs that match the required output format exactly.

    When to use: Any feature with structured output requirements.

    How to measure: Automated parser — pass/fail.

    Target range: 100%. Format failures are never acceptable at scale.

    Refusal accuracy

    Definition: % of refusals that were correct (true out-of-scope) vs. false refusals (incorrectly

    declined valid requests).

    When to use: Any feature with scope limits or safety filters.

    How to measure: Sample-based review of all refusals.

    Target range: False refusal rate ≤ 3%.

    Coherence score

    Definition: Does the output make logical sense end-to-end? (No contradictions, non-sequiturs,

    or incomplete reasoning.)

    When to use: Long-form generation, reasoning tasks, multi-step outputs.

    How to measure: Human rubric (1–5 scale). See aipm-hitl-eval for rubric design.

    Target range: ≥ 4.0 average.

    ---

    Step 3 — Select the system reliability metrics

    SYSTEM RELIABILITY METRICS
    
    

    Latency (p50, p95, p99)

    Definition: Time from request to first token / full response. Track all three percentiles —

    p50 tells you the typical experience; p99 tells you the worst 1%.

    Target range: Define by feature type — real-time features: p95 ≤ 2s; async features: p95 ≤ 10s.

    Error rate

    Definition: % of requests that return an error (model error, timeout, rate limit, malformed output).

    Target range: ≤ 0.5% for customer-facing features.

    Availability

    Definition: % of time the AI feature is functional and accepting requests.

    Target range: ≥ 99.5% for customer-facing; ≥ 99.0% for internal tools.

    Token cost per output

    Definition: Average token spend per successful output. Tracks cost efficiency.

    When to use: Any feature at meaningful scale (>1k requests/day).

    Target range: Set a budget ceiling and alert when exceeded.

    Context window utilisation

    Definition: % of the context window used on average per request.

    When to use: Features that include large documents or long conversation history.

    Target range: Alert when average exceeds 70% — approaching window limit impacts quality.

    ---

    Step 4 — Select the user impact metrics

    USER IMPACT METRICS
    
    

    Task completion rate

    Definition: % of users who complete the intended action after receiving AI output.

    Example: For an AI that writes email drafts — % of drafts sent without major edit.

    How to measure: Product analytics on the action that follows AI output use.

    Output acceptance rate

    Definition: % of AI outputs the user accepts vs. edits or discards.

    When to use: Any copilot or draft-generation feature.

    How to measure: Track accept/edit/discard events in the product.

    Target range: > 60% acceptance rate is a healthy signal. Below 40% indicates quality problems.

    User satisfaction (CSAT on AI outputs)

    Definition: User rating of the AI output quality, collected in-product.

    How to measure: Thumbs up / thumbs down on individual outputs, or a 1–5 star rating.

    When to use: Any customer-facing AI feature.

    Target range: > 80% thumbs-up rate.

    Re-query rate

    Definition: % of users who ask the same question or retry the same task after receiving

    an AI output — indicating the output didn't satisfy them.

    How to measure: Session analysis — same user, same intent, within 5 minutes.

    Target range: < 15%. Above 25% is a quality signal to investigate.

    AI-assisted feature engagement

    Definition: Are users who engage with the AI feature more retained or activated than

    those who don't?

    How to measure: Segment users by AI feature engagement; compare retention/activation.

    Why this matters: Proves the AI feature has product value, not just usage.

    ---

    Step 5 — Build the metrics dashboard specification

    Define the metrics, targets, owners, and reporting cadence in a single table:

    AI QUALITY METRICS DASHBOARD: [Feature name]
    

    As of: [date] Stakes level: [level] Owner: [PM name]

    LAYER 1 — OUTPUT QUALITY

    Metric | Target | Method | Cadence | Owner

    [Metric 1] | [threshold] | [auto/human] | [weekly] | [PM]

    [Metric 2] | [threshold] | [auto/human] | [weekly] | [Eng]

    LAYER 2 — SYSTEM RELIABILITY

    Metric | Target | Method | Cadence | Owner

    Latency p95 | ≤ [N]ms | Automated | Real-time | Eng

    Error rate | ≤ [N]% | Automated | Real-time | Eng

    Availability | ≥ [N]% | Automated | Real-time | Eng

    LAYER 3 — USER IMPACT

    Metric | Target | Method | Cadence | Owner

    [Metric 1] | [threshold] | Analytics | Weekly | PM

    [Metric 2] | [threshold] | Analytics | Weekly | PM

    ALERT THRESHOLDS (immediate action required):

  • Error rate > [X]% → page on-call engineer
  • Output acceptance rate drops > 10% week-over-week → PM review within 24 hours
  • Latency p95 exceeds [N]ms for > 30 mins → incident declared
  • CSAT drops below [X]% → PM + engineering review same day
  • ---

    Step 6 — Define the metric review process

    METRIC REVIEW PROCESS
    
    

    Daily (automated, no meeting):

  • Error rate and availability: automated alerts only
  • Latency p95: automated alerts if threshold breached
  • Weekly (PM reviews):

  • Output quality metrics: review sample-based eval results
  • User impact metrics: pull from analytics
  • Flag any metric that moved > 10% week-over-week
  • Write a 3-bullet weekly quality summary: [improving / stable / degrading] per layer
  • Monthly (PM + engineering sync):

  • Trend analysis: is each metric trending better, stable, or worse over 4 weeks?
  • Metric review: are the targets still the right targets?
  • Backlog prioritisation: which quality issues are now worth a sprint?
  • METRIC HEALTH SCORING:

    Green: Within 10% of target, stable or improving

    Amber: 10–25% off target, or trending worse for 2+ weeks

    Red: > 25% off target, or a sudden drop > 20% in any week

    ---

    Step 7 — Output the AI quality metrics document

    # AI Quality Metrics: [Feature name]
    Date: [date]  Stakes level: [level]  Owner: [PM name]
    
    

    ---

    Selected metrics

    Output quality

    [Selected metrics with definitions, targets, and measurement method]

    System reliability

    [Selected metrics with targets]

    User impact

    [Selected metrics with targets]

    Dashboard specification

    [Full table from Step 5]

    Alert thresholds

    [List from Step 5]

    Review process

    [Cadence and ownership from Step 6]

    Metric health legend

    Green / Amber / Red definitions

    Known gaps

    [Metrics you want but can't measure yet — with what it would take to add them]

    ---

    Quality check before delivering

    All three layers are represented — not just reliability metrics
    Every metric has a defined target — not "we'll see what's reasonable"
    User impact metrics are included — not just output quality
    Alert thresholds are defined for the most critical metrics
    Measurement method is specified — automated vs. human vs. analytics
    Review cadence assigns an owner to each layer

    ---

    Output

    Deliver the complete AI quality metrics document.

    Then add:

    > Suggested next step: Before building dashboards, instrument the user impact metrics first. Output quality can be measured on demand with eval runs. User impact metrics require product analytics events — these take engineering time to add and are often deprioritised. Get them in the next sprint before the feature ships.