Back to library

Automated QA for AI Features

Testing AI features is not the same as testing deterministic software. An AI feature can pass all its unit tests and still produce wrong outputs in production — because the model is non-deterministic, because prompts drift, because input distributions shift. The automated QA pipeline for an AI feature must test both the code layer (deterministic) and the AI output layer (probabilistic). This skill defines the full pipeline.

---

Context

The three testing layers for AI features:
LayerWhat it testsTools
Code layerThe software that calls the AI — routing, parsing, error handling, tool callsStandard unit/integration tests
Contract layerDoes the AI response match the expected schema and format?Schema validation, output parsers
Quality layerIs the AI output actually good?Eval framework, human spot-check, regression suite
The core challenge:

You cannot assert \expect(aiOutput).toBe("correct answer")\ because the AI output varies.

QA for AI features requires: schema assertions (deterministic), content assertions (probabilistic), and eval assertions (run against a labelled test set).

---

Step 1 — Define the testing context

Ask:

  • What AI feature is being tested?
  • What is the primary output type?
  • What is the call architecture? (Direct API call / RAG pipeline / Agent / Multi-agent)
  • What is the stakes level?
  • What CI/CD system does the team use?
  • What testing frameworks are in use?
  • Step 2 — Define the code layer tests

    Unit tests with mocked AI responses covering: input validation, output parsing, retry logic, rate limit handling, timeout handling, and token limits.

    Step 3 — Define the contract layer tests

    Schema validation on every AI output. Schema violations trigger graceful errors and engineering alerts.

    Step 4 — Define the quality layer tests (eval tests)

    Functional evals, adversarial evals, and regression evals. Run on prompt changes (blocking), model changes (blocking), and weekly production samples.

    Step 5 — Define agent-specific test requirements

    Tool call tests, loop tests, guardrail tests, and multi-step scenario tests.

    Step 6 — Define the CI/CD pipeline structure

    On every commit: lint, unit tests, integration tests. On prompt/model change: full eval suite (blocking). Weekly: production eval sample and drift monitoring.

    Quality check before delivering

    All three testing layers are covered — code, contract, and quality
    Eval tests are defined with explicit pass thresholds
    Prompt changes are a BLOCKING CI gate
    Mock response library covers failure modes
    Agent-specific tests are included if the feature has agentic behaviour
    Rollback trigger is defined with a specific error rate threshold
    Suggested next step: Build the mock response library before writing the first test. Ten realistic mock responses will make every test meaningful and the suite maintainable.