Back to library

A/B Testing AI Features

A/B testing an AI feature is not the same as A/B testing a button colour. AI outputs are non-deterministic — the same user might get a different output on two visits to the same variant. This skill designs statistically sound experiments specifically for AI features.

Context

How AI A/B testing differs:

Standard A/B testAI A/B test
Variant is deterministicVariant is probabilistic
Primary metric is click/conversionPrimary metric may be output quality or trust
Straightforward sample sizeLarger samples needed due to variance
Winning variant is deployed permanentlyWinning prompt/model needs version control

Three types of AI experiments:

TypeWhat it testsExample
Prompt experimentDoes prompt A produce better outputs than prompt B?Zero-shot vs. few-shot
Model experimentDoes model A produce better outcomes than model B?GPT-4o vs. Claude Sonnet
Feature experimentDoes the AI feature improve user outcomes vs. no AI?With AI summary vs. without

Step 1 — Define the experiment

EXPERIMENT DEFINITION:

Hypothesis: We believe that [change] will cause [outcome] because [reason].

Type: [Prompt / Model / Feature experiment]

Control (A): [Description]

Variant (B): [Description]

Primary metric: [e.g. "Output acceptance rate"]

Guardrail metrics: [e.g. "Latency p95 must not increase > 500ms"]

Minimum detectable effect: [e.g. "+5 percentage points"]

Step 2 — Calculate sample size

Apply an AI variance multiplier:

  • Free-text output quality metrics: multiply by 1.5–2x
  • Binary acceptance metrics: standard sample size applies
  • CSAT ratings (1–5 scale): multiply by 1.3x
  • Minimum experiment duration: 7 days (day-of-week variation). Maximum: 28 days.

    Step 3 — Design the setup

  • Assignment unit: User ID (not session ID)
  • Prompt/model isolation: Lock versions during the experiment
  • Contamination prevention: Exclude users who saw both variants
  • Novelty effect mitigation: Exclude first 3 days from primary analysis
  • Step 4 — Define the analysis plan

    Decision criterion:
    

    p < 0.05 AND observed effect ≥ MDE → consider shipping

    p < 0.05 with effect below MDE → significant but not practical

    Run segment analysis: An AI variant can win on average while losing for a high-value segment.

    Step 5 — Define the decision framework

  • SHIP when all analysis checklist items pass and PM reviews output samples
  • ITERATE when effect is in right direction but below MDE
  • INVESTIGATE when guardrail metrics violated
  • EXTEND when sample size not reached
  • AI-specific rule: Before shipping, PM must manually review 50 random outputs from the winning variant.

    Quality check before delivering

    Hypothesis is testable — not "improve the AI"
    Assignment unit is user ID — not session ID
    Sample size includes AI variance multiplier
    Guardrail metrics defined before experiment runs
    PM manual output review required before shipping
    Suggested next step: Define the decision criteria before looking at results — not after.