Back to library

Embedding Model Selection and Specification

Embedding models convert text (or other content) into vectors — the numerical representations that power semantic search, RAG retrieval, and similarity matching. Choosing the wrong model means rebuilding your entire index when you switch. This skill selects the right embedding model for the task and writes the specification that makes the choice stick.

---

Context

What embedding models do:

An embedding model takes a piece of text and outputs a fixed-length array of numbers (a vector) that represents its meaning. Similar texts produce similar vectors.

The embedding model decision axes:
  • Quality — How well does it capture semantic similarity for your content?
  • Dimensions — How many numbers in each vector? More = more storage
  • Deployment — API call vs. self-hosted
  • Cost — Per-token pricing for API models; compute cost for self-hosted
  • Multilingual — Does it support your languages?
  • Domain — General-purpose or domain-optimised?
  • ---

    Step 1 — Define the embedding use case

    Ask: content type, query pattern, languages, domain, expected volume, and downstream task.

    Step 2 — Evaluate candidate models

    API-based: OpenAI text-embedding-3-small (1536d, $0.02/1M tokens), OpenAI text-embedding-3-large (3072d, $0.13/1M tokens), Cohere Embed v3 (1024d, best multilingual). Open-source: nomic-embed-text-v1.5 (768d, long context, free), BAAI/bge-m3 (1024d, 100+ languages).

    Step 3 — Define the model selection criteria and decision

    Primary factors: multilingual requirement, data sovereignty, volume cost, context length needs, quality vs. cost tradeoff.

    Step 4 — Write the embedding model specification

    Include: model name and version (pinned), input preprocessing, batching configuration, embedding storage schema, similarity metric and thresholds, and refresh policy.

    Quality check before delivering

    Model is selected by name and version — not just provider
    Multilingual requirement is answered explicitly
    12-month volume cost is calculated
    Dimension size is recorded (must match vector DB config)
    Re-embed trigger policy is defined
    Similarity threshold is defined
    Suggested next step: Before building the index, run a small-scale quality evaluation. Take 20 representative content items and 20 queries. Embed both, compute similarity, and manually judge relevance. Catch model-fit issues before you've built the vector database.