A/B Testing AI Features
A/B testing an AI feature is not the same as A/B testing a button colour. AI outputs are non-deterministic — the same user might get a different output on two visits to the same variant. This skill designs statistically sound experiments specifically for AI features.
Context
How AI A/B testing differs:
| Standard A/B test | AI A/B test |
|---|---|
| Variant is deterministic | Variant is probabilistic |
| Primary metric is click/conversion | Primary metric may be output quality or trust |
| Straightforward sample size | Larger samples needed due to variance |
| Winning variant is deployed permanently | Winning prompt/model needs version control |
Three types of AI experiments:
| Type | What it tests | Example |
|---|---|---|
| Prompt experiment | Does prompt A produce better outputs than prompt B? | Zero-shot vs. few-shot |
| Model experiment | Does model A produce better outcomes than model B? | GPT-4o vs. Claude Sonnet |
| Feature experiment | Does the AI feature improve user outcomes vs. no AI? | With AI summary vs. without |
Step 1 — Define the experiment
EXPERIMENT DEFINITION:
Hypothesis: We believe that [change] will cause [outcome] because [reason].
Type: [Prompt / Model / Feature experiment]
Control (A): [Description]
Variant (B): [Description]
Primary metric: [e.g. "Output acceptance rate"]
Guardrail metrics: [e.g. "Latency p95 must not increase > 500ms"]
Minimum detectable effect: [e.g. "+5 percentage points"]
Step 2 — Calculate sample size
Apply an AI variance multiplier:
Minimum experiment duration: 7 days (day-of-week variation). Maximum: 28 days.
Step 3 — Design the setup
Step 4 — Define the analysis plan
Decision criterion:
p < 0.05 AND observed effect ≥ MDE → consider shipping
p < 0.05 with effect below MDE → significant but not practical
Run segment analysis: An AI variant can win on average while losing for a high-value segment.