Embedding Model Selection and Specification

Embedding models convert text (or other content) into vectors — the numerical representations that power semantic search, RAG retrieval, and similarity matching. Choosing the wrong model means rebuilding your entire index when you switch. This skill selects the right embedding model for the task and writes the specification that makes the choice stick.

---

Context

What embedding models do:

An embedding model takes a piece of text and outputs a fixed-length array of numbers (a vector) that represents its meaning. Similar texts produce similar vectors.

The embedding model decision axes:

Quality — How well does it capture semantic similarity for your content?

Dimensions — How many numbers in each vector? More = more storage

Deployment — API call vs. self-hosted

Cost — Per-token pricing for API models; compute cost for self-hosted

Multilingual — Does it support your languages?

Domain — General-purpose or domain-optimised?

---

Step 1 — Define the embedding use case

Ask: content type, query pattern, languages, domain, expected volume, and downstream task.

Step 2 — Evaluate candidate models

API-based: OpenAI text-embedding-3-small (1536d, $0.02/1M tokens), OpenAI text-embedding-3-large (3072d, $0.13/1M tokens), Cohere Embed v3 (1024d, best multilingual). Open-source: nomic-embed-text-v1.5 (768d, long context, free), BAAI/bge-m3 (1024d, 100+ languages).

Step 3 — Define the model selection criteria and decision

Primary factors: multilingual requirement, data sovereignty, volume cost, context length needs, quality vs. cost tradeoff.

Step 4 — Write the embedding model specification

Include: model name and version (pinned), input preprocessing, batching configuration, embedding storage schema, similarity metric and thresholds, and refresh policy.

Quality check before delivering

Model is selected by name and version — not just provider

Multilingual requirement is answered explicitly

12-month volume cost is calculated

Dimension size is recorded (must match vector DB config)

Re-embed trigger policy is defined

Similarity threshold is defined

Suggested next step: Before building the index, run a small-scale quality evaluation. Take 20 representative content items and 20 queries. Embed both, compute similarity, and manually judge relevance. Catch model-fit issues before you've built the vector database.