Automated QA for AI Features

Testing AI features is not the same as testing deterministic software. An AI feature can pass all its unit tests and still produce wrong outputs in production — because the model is non-deterministic, because prompts drift, because input distributions shift. The automated QA pipeline for an AI feature must test both the code layer (deterministic) and the AI output layer (probabilistic). This skill defines the full pipeline.

---

Context

The three testing layers for AI features:

Layer	What it tests	Tools
Code layer	The software that calls the AI — routing, parsing, error handling, tool calls	Standard unit/integration tests
Contract layer	Does the AI response match the expected schema and format?	Schema validation, output parsers
Quality layer	Is the AI output actually good?	Eval framework, human spot-check, regression suite

The core challenge:

You cannot assert \expect(aiOutput).toBe("correct answer")\ because the AI output varies.

QA for AI features requires: schema assertions (deterministic), content assertions (probabilistic), and eval assertions (run against a labelled test set).

---

Step 1 — Define the testing context

Ask:

What AI feature is being tested?

What is the primary output type?

What is the call architecture? (Direct API call / RAG pipeline / Agent / Multi-agent)

What is the stakes level?

What CI/CD system does the team use?

What testing frameworks are in use?

Step 2 — Define the code layer tests

Unit tests with mocked AI responses covering: input validation, output parsing, retry logic, rate limit handling, timeout handling, and token limits.

Step 3 — Define the contract layer tests

Schema validation on every AI output. Schema violations trigger graceful errors and engineering alerts.

Step 4 — Define the quality layer tests (eval tests)

Functional evals, adversarial evals, and regression evals. Run on prompt changes (blocking), model changes (blocking), and weekly production samples.

Step 5 — Define agent-specific test requirements

Tool call tests, loop tests, guardrail tests, and multi-step scenario tests.

Step 6 — Define the CI/CD pipeline structure

On every commit: lint, unit tests, integration tests. On prompt/model change: full eval suite (blocking). Weekly: production eval sample and drift monitoring.

Quality check before delivering

All three testing layers are covered — code, contract, and quality

Eval tests are defined with explicit pass thresholds

Prompt changes are a BLOCKING CI gate

Mock response library covers failure modes

Agent-specific tests are included if the feature has agentic behaviour

Rollback trigger is defined with a specific error rate threshold

Suggested next step: Build the mock response library before writing the first test. Ten realistic mock responses will make every test meaningful and the suite maintainable.