A/B Testing AI Features

A/B testing an AI feature is not the same as A/B testing a button colour. AI outputs are non-deterministic — the same user might get a different output on two visits to the same variant. This skill designs statistically sound experiments specifically for AI features.

Context

How AI A/B testing differs:

Standard A/B test	AI A/B test
Variant is deterministic	Variant is probabilistic
Primary metric is click/conversion	Primary metric may be output quality or trust
Straightforward sample size	Larger samples needed due to variance
Winning variant is deployed permanently	Winning prompt/model needs version control

Three types of AI experiments:

Type	What it tests	Example
Prompt experiment	Does prompt A produce better outputs than prompt B?	Zero-shot vs. few-shot
Model experiment	Does model A produce better outcomes than model B?	GPT-4o vs. Claude Sonnet
Feature experiment	Does the AI feature improve user outcomes vs. no AI?	With AI summary vs. without

Step 1 — Define the experiment

EXPERIMENT DEFINITION:

Hypothesis: We believe that [change] will cause [outcome] because [reason].
Type: [Prompt / Model / Feature experiment]
Control (A): [Description]
Variant (B): [Description]
Primary metric: [e.g. "Output acceptance rate"]
Guardrail metrics: [e.g. "Latency p95 must not increase > 500ms"]
Minimum detectable effect: [e.g. "+5 percentage points"]

Step 2 — Calculate sample size

Apply an AI variance multiplier:

Free-text output quality metrics: multiply by 1.5–2x

Binary acceptance metrics: standard sample size applies

CSAT ratings (1–5 scale): multiply by 1.3x

Minimum experiment duration: 7 days (day-of-week variation). Maximum: 28 days.

Step 3 — Design the setup

Assignment unit: User ID (not session ID)

Prompt/model isolation: Lock versions during the experiment

Contamination prevention: Exclude users who saw both variants

Novelty effect mitigation: Exclude first 3 days from primary analysis

Step 4 — Define the analysis plan

Decision criterion: p < 0.05 AND observed effect ≥ MDE → consider shipping

p < 0.05 with effect below MDE → significant but not practical

Run segment analysis: An AI variant can win on average while losing for a high-value segment.

Step 5 — Define the decision framework

SHIP when all analysis checklist items pass and PM reviews output samples

ITERATE when effect is in right direction but below MDE

INVESTIGATE when guardrail metrics violated

EXTEND when sample size not reached

AI-specific rule: Before shipping, PM must manually review 50 random outputs from the winning variant.

Quality check before delivering

Hypothesis is testable — not "improve the AI"

Assignment unit is user ID — not session ID

Sample size includes AI variance multiplier

Guardrail metrics defined before experiment runs

PM manual output review required before shipping

Suggested next step: Define the decision criteria before looking at results — not after.