Embeddings explained: how they work and which to use in 2026
A practical guide to embeddings for AI builders. Covers how embeddings work, the best models in 2026, and working Python code for generating and searching embeddings.
Embeddings are the foundation of modern AI retrieval systems. If you are building search, RAG, recommendations, clustering, or semantic matching, embeddings are not optional. This guide is practical and implementation-first: what embeddings are, which models matter in 2026, and how to build a working pipeline in Python.
1. What are embeddings and why every AI builder needs to understand them
An embedding is a numeric representation of content (usually text, sometimes image/code/audio) where similar meaning is placed closer together in vector space.
Why this matters:
- Keyword search misses semantic meaning. Embeddings recover intent.
- RAG quality depends heavily on embedding quality.
- Recommendation systems and dedup pipelines use vector similarity.
- Classifiers can be built faster using embedding features.
Concrete builder problems solved by embeddings:
- “Reset password” vs “can’t sign in” should match.
- Support tickets should cluster by issue type.
- Docs search should return conceptually relevant answers, not only exact term matches.
If you are building any knowledge workflow, understanding embeddings gives you direct control over recall, precision, latency, and cost.
2. How embeddings work: vectors, dimensions, and similarity explained simply
A model maps text to a vector like:
"How do I reset my password?" -> [0.018, -0.442, 0.091, ...]That vector has N dimensions. Larger dimensions can capture richer signals, but they increase storage and retrieval cost.
Core concepts:
- Distance / similarity: closer vectors mean more semantic similarity.
- Cosine similarity: compares direction, widely used for text embeddings.
- Dot product: popular in ANN systems with normalized vectors.
- Euclidean distance: used in some pipelines, less common for normalized text embeddings.
Practical rule: normalize vectors and use cosine or dot product unless your model docs recommend otherwise.
Small intuition example:
- “refund policy” and “return policy” vectors should be close.
- “GPU kernel panic” should be far from “invoice payment terms.”
If this behavior is not happening, your model choice, chunking, or preprocessing is likely wrong.
3. Text embeddings: the models that matter in 2026
In 2026, most builders use a mix of hosted APIs and open models:
- OpenAI
text-embedding-3-smallfor cost-sensitive high-volume retrieval. - OpenAI
text-embedding-3-largefor stronger multilingual and hard semantic tasks. - Google
gemini-embedding-001as a strong general embedding option in Google ecosystems. - Google
text-embedding-005andtext-multilingual-embedding-002in existing Vertex pipelines. - Cohere
embed-v4.0for text/image/mixed pipelines, plusembed-english-v3.0andembed-multilingual-v3.0in existing stacks. - Open-source options such as
BAAI/bge-m3,intfloat/e5-large-v2, andjinaai/jina-embeddings-v3for self-hosted control.
You do not need every model. Pick two candidates, run retrieval evals on your own data, and choose by quality-per-dollar.
4. Embedding models compared: OpenAI, Cohere, Google, open-source options
Pricing and capabilities change, so verify before production commitment. Table below uses officially published information where available and clearly marks what must be re-checked.
| Provider | Model | Strengths | Typical dimensions | Pricing signal |
|---|---|---|---|---|
| OpenAI | text-embedding-3-small | Strong quality at very low cost for many RAG use cases | 1536 default | Published: about $0.02 / 1M input tokens (verify current pricing page) |
| OpenAI | text-embedding-3-large | Better quality for harder retrieval and multilingual tasks | up to 3072 | Published: about $0.13 / 1M input tokens (verify current pricing page) |
gemini-embedding-001 | High-quality unified text/code/multilingual embedding in Google stack | up to 3072 | Gemini API published paid tier around $0.15 / 1M input tokens, batch lower (verify) | |
text-embedding-005 | English/code specialized compatibility model | up to 768 | Check Vertex/Gemini pricing docs | |
| Cohere | embed-v4.0 | Multimodal support (text + image + mixed docs), long context | configurable, default 1536 | Check Cohere pricing page for current embed rates |
| Cohere | embed-english-v3.0 / embed-multilingual-v3.0 | Mature v3 family used in many existing deployments | 1024 | Check Cohere pricing page |
| Open-source | bge-m3, e5-large-v2, jina-embeddings-v3 | Full control, self-hosting, no per-token API fee | varies | Infra cost only (GPU/CPU + ops) |
Honest performance guidance:
- API models usually win on convenience and baseline quality consistency.
- Open-source models can be very competitive when tuned and served well.
- Most teams underperform because of weak chunking and evaluation, not because they picked the “wrong” top-tier model.
5. How to choose the right embedding model for your use case
Use this decision flow:
- Need fastest time-to-production and minimal infra? Start with hosted API.
- Need strict data residency, custom infra, or lower marginal cost at high volume? Evaluate open-source self-hosted.
- Need multilingual + code + long-context coverage in one model? Favor unified modern embedding models.
- Need multimodal document retrieval? Choose a model with explicit text+image support.
Selection criteria that actually matter:
- Retrieval quality on your corpus: Recall@k, MRR, nDCG.
- End-to-end latency: embed + search + rerank.
- Cost per successful answer, not cost per token alone.
- Operational fit: SDK, quotas, monitoring, incident workflow.
Practical advice:
- Benchmark 2-3 models only.
- Use same chunking and same vector DB during comparison.
- Evaluate on real production-like queries, not synthetic toy prompts.
6. Generating embeddings in Python: working code examples
OpenAI API example
pip install -U openai numpyfrom openai import OpenAI
import numpy as np
client = OpenAI()
texts = [
"How do I reset my password?",
"I cannot sign in to my account.",
"What is your refund policy?",
]
resp = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
vectors = [row.embedding for row in resp.data]
print("num_vectors:", len(vectors), "dim:", len(vectors[0]))
print("dtype:", type(vectors[0][0]))
print("avg_norm:", float(np.mean([np.linalg.norm(v) for v in vectors])))Open-source local example
pip install -U sentence-transformersfrom sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
texts = ["password reset issue", "invoice payment terms"]
vectors = model.encode(texts, normalize_embeddings=True)
print("num_vectors:", len(vectors))
print("dim:", len(vectors[0]))
print("first_5:", vectors[0][:5])Both snippets are production-usable starting points. Add retries, monitoring, and batching limits for real traffic.
7. Storing and searching embeddings: vector databases explained
After creating vectors, you store them in a vector index and query by similarity.
Popular choices:
- Pinecone: managed, fast to production.
- Weaviate: flexible schema + hybrid search.
- pgvector: best when your stack is PostgreSQL-first.
- Chroma: simple local and MVP workflows.
Minimal local search example with FAISS
pip install -U faiss-cpu numpyimport faiss
import numpy as np
# Example normalized vectors (replace with real embeddings)
docs = np.array([
[0.10, 0.90, 0.10],
[0.80, 0.10, 0.10],
[0.12, 0.86, 0.15],
], dtype="float32")
query = np.array([[0.11, 0.88, 0.11]], dtype="float32")
index = faiss.IndexFlatIP(docs.shape[1]) # dot-product search
index.add(docs)
scores, ids = index.search(query, k=2)
print("top_ids:", ids[0].tolist())
print("scores:", scores[0].tolist())For large-scale systems, move to ANN indexes and add metadata filtering, reranking, and tenant isolation.
8. Embedding pitfalls: what goes wrong and how to fix it
Common failure modes:
- Bad chunking: vectors represent mixed topics and retrieval becomes noisy.
- Wrong metric: cosine vs dot mismatch hurts ranking.
- No normalization: similarity becomes unstable.
- Domain mismatch: generic embedding model on highly specialized text.
- No evaluation loop: quality regresses silently.
- Metadata ignored: results leak across tenant/product boundaries.
Fix strategy:
- Start with clean chunking (
~500-1000tokens + overlap depending on docs). - Normalize vectors if your model/index expects it.
- Add reranker for top-k candidate refinement.
- Run weekly retrieval regression tests.
- Track bad queries and feed them into evaluation sets.
If retrieval quality is poor, fix data and indexing before replacing your whole model stack.
9. Multimodal embeddings: text, images, and code
Multimodal embeddings place different data types in a shared vector space. This enables:
- Text-to-image retrieval.
- Screenshot-to-doc matching.
- Cross-modal recommendations.
Code embeddings are also increasingly important:
- Natural language query -> code snippet retrieval.
- Similar function detection.
- Security/code review triage workflows.
Practical guidance:
- Use specialized multimodal embedding models for image-heavy pipelines.
- Keep separate indexes for different modalities unless your model is truly joint-space by design.
- Evaluate each modality with its own ground-truth set.
Multimodal search is powerful, but operational complexity rises quickly. Start narrow, prove value, then expand.
10. What to learn next
After mastering embeddings, focus on retrieval evaluation, reranking, and hybrid search. Build a benchmark dataset from real production queries and track Recall@k and answer success rate over time. Learn ANN indexing tradeoffs, metadata filtering, and semantic caching. If you are building RAG, combine strong embeddings with prompt design and output validation. The best systems are not built from one “best model,” but from disciplined evaluation and iteration.
Suggested next steps:
- Build a 100-query retrieval benchmark.
- Compare two embedding models on identical data.
- Add reranking and measure gain.
- Add cost dashboards per feature.
Next article
Fine-tuning LLMs: complete guide to LoRA, QLoRA, and when to fine-tune (2026)A practical guide to fine-tuning large language models in 2026. Covers LoRA, QLoRA, dataset creation, and an honest framework for when fine-tuning beats RAG.