Confidence Scoring
An AI that always sounds certain is a liability. Users who can't tell the difference between a confident accurate output and a confident hallucination will eventually trust the AI when they shouldn't — and the consequences scale with the stakes. Confidence scoring is the PM mechanism for giving users the information they need to decide when to act on AI output and when to verify it. This skill designs the scoring system, the UX, and the logic that drives both.
---
Context
What confidence scoring is not:Confidence scoring is not just exposing the model's raw probability scores. Softmax probabilities are poorly calibrated and meaningless to non-technical users. Confidence scoring is a product design decision.
The three signals that drive confidence:| Signal | How it works | Best for |
|---|---|---|
| Retrieval confidence | How relevant were the retrieved source documents? (RAG features) | Q&A, document search, knowledge bases |
| Self-assessed uncertainty | Prompt the model to flag its own uncertainty | General-purpose AI, free-text generation |
| Downstream validation | A second model call checks the output for errors | High-stakes factual features |
---
Step 1 — Define the confidence scoring context
Ask:
Step 2 — Choose the confidence scoring approach
Approach A — Retrieval-based confidence (for RAG features)
Score relevance using cosine similarity thresholds: High ≥ 0.85, Medium 0.70–0.84, Low < 0.70.
Approach B — Self-assessed uncertainty (for general features)
Prompt the model to evaluate its own confidence as HIGH/MEDIUM/LOW with reasoning.
Approach C — Downstream validation scoring (for Critical/High stakes)
A second model call checks for factual errors. Adds latency and cost.
Step 3 — Design the confidence UX
Four patterns: colour-coded signal, inline disclaimer, source citation display, or refusal threshold.
UX anti-patterns to avoid: Showing percentage scores, using "I think" without a systematic signal, showing confidence only in settings.Step 4 — Define the calibration process
Collect 100 outputs, human-rate accuracy, build calibration table. A well-calibrated system: HIGH ≥ 90% accurate, MEDIUM 60–89%, LOW < 60%. Recalibrate on model/prompt changes and distribution shifts.