Automated QA for AI Features
Testing AI features is not the same as testing deterministic software. An AI feature can pass all its unit tests and still produce wrong outputs in production — because the model is non-deterministic, because prompts drift, because input distributions shift. The automated QA pipeline for an AI feature must test both the code layer (deterministic) and the AI output layer (probabilistic). This skill defines the full pipeline.
---
Context
The three testing layers for AI features:| Layer | What it tests | Tools |
|---|---|---|
| Code layer | The software that calls the AI — routing, parsing, error handling, tool calls | Standard unit/integration tests |
| Contract layer | Does the AI response match the expected schema and format? | Schema validation, output parsers |
| Quality layer | Is the AI output actually good? | Eval framework, human spot-check, regression suite |
You cannot assert \expect(aiOutput).toBe("correct answer")\ because the AI output varies.
QA for AI features requires: schema assertions (deterministic), content assertions (probabilistic), and eval assertions (run against a labelled test set).
---
Step 1 — Define the testing context
Ask:
Step 2 — Define the code layer tests
Unit tests with mocked AI responses covering: input validation, output parsing, retry logic, rate limit handling, timeout handling, and token limits.
Step 3 — Define the contract layer tests
Schema validation on every AI output. Schema violations trigger graceful errors and engineering alerts.
Step 4 — Define the quality layer tests (eval tests)
Functional evals, adversarial evals, and regression evals. Run on prompt changes (blocking), model changes (blocking), and weekly production samples.
Step 5 — Define agent-specific test requirements
Tool call tests, loop tests, guardrail tests, and multi-step scenario tests.
Step 6 — Define the CI/CD pipeline structure
On every commit: lint, unit tests, integration tests. On prompt/model change: full eval suite (blocking). Weekly: production eval sample and drift monitoring.