Back to library

AI Safety Requirements

AI safety is the set of controls that prevent an AI system from causing unintended harm — to users, third parties, or broader society. For AI PMs, safety is not a philosophical discussion. It's a requirement: what specific controls must be in place before this feature ships, and how do we verify they're working? This skill identifies the safety requirements specific to your feature and turns them into implementable product specifications.

---

Context

AI safety risk categories (know which apply to your feature):
Risk categoryDescriptionExample
Direct harmThe AI causes direct physical, psychological, or financial harm to a userMedical AI gives dangerous advice; financial AI recommends a harmful action
Facilitated harmThe AI enables a user to harm othersAI helps generate targeted harassment; AI provides weapons instructions
Runaway behaviourThe AI takes unintended autonomous actions with real-world consequencesAgent deletes files the user didn't intend to delete; agent sends emails without permission
ManipulationThe AI influences user beliefs or behaviour in ways that undermine their autonomyAI optimises for engagement in ways that exploit psychological vulnerabilities
Systemic harmAggregate AI behaviour causes harm at scale even if individual outputs seem benignAI recommendation system that systematically amplifies polarising content
The safety design principle:

Safety controls should be designed for the worst realistic user, not the average user.

---

Step 1 — Identify the safety risk profile

Ask:

  • What could this AI feature do that would directly harm a user?
  • What could this AI feature do that would enable a user to harm others?
  • If the AI behaves autonomously — what actions could it take without the user noticing?
  • What is the worst single output this feature could produce?
  • What is the worst aggregate behaviour at scale?
  • Who are the most vulnerable users?
  • Step 2 — Apply the safety control stack

    Control 1 — Output content filters

    Define prohibited content categories and implement multi-layer filtering: system prompt instruction, output validation, and user reporting.

    Control 2 — Vulnerable user protections

    Define which vulnerable populations might use this feature and apply appropriate protections including crisis detection for consumer-facing features.

    Control 3 — Agentic safety controls

    For agentic features: irreversibility controls, blast radius limitation, intent verification, and override/abort mechanisms.

    Control 4 — Anti-manipulation controls

    The AI must not use urgency tactics, fabricate social proof, form parasocial relationships, or exploit psychological vulnerabilities.

    Control 5 — Systemic safety monitoring

    Monthly aggregate behaviour review, misuse pattern detection, feedback loop audit, and third-party impact assessment.

    Step 3 — Output the AI safety requirements document

    Include: safety risk profile, all applicable safety controls, pre-launch verification checklist, and residual risks accepted.

    Quality check before delivering

    Risk profile identifies worst single output AND worst aggregate behaviour
    Content filter categories are specific to this feature's domain
    Crisis detection is included for any consumer-facing feature
    Agentic safety controls are included if the feature has autonomous behaviour
    Residual risks are honest — no safety plan eliminates all risk
    Pre-launch verification is actionable
    Suggested next step: Test the crisis detection pathway before any other safety control. It has the highest stakes and the hardest edge cases.