Last verified 2026-03-21

Embeddings explained: how they work and which to use in 2026

A practical guide to embeddings for AI builders. Covers how embeddings work, the best models in 2026, and working Python code for generating and searching embeddings.

By Knovo Team2026-03-2113 min read

Embeddings are the foundation of modern AI retrieval systems. If you are building search, RAG, recommendations, clustering, or semantic matching, embeddings are not optional. This guide is practical and implementation-first: what embeddings are, which models matter in 2026, and how to build a working pipeline in Python.

1. What are embeddings and why every AI builder needs to understand them

An embedding is a numeric representation of content (usually text, sometimes image/code/audio) where similar meaning is placed closer together in vector space.

Why this matters:

  1. Keyword search misses semantic meaning. Embeddings recover intent.
  2. RAG quality depends heavily on embedding quality.
  3. Recommendation systems and dedup pipelines use vector similarity.
  4. Classifiers can be built faster using embedding features.

Concrete builder problems solved by embeddings:

  1. “Reset password” vs “can’t sign in” should match.
  2. Support tickets should cluster by issue type.
  3. Docs search should return conceptually relevant answers, not only exact term matches.

If you are building any knowledge workflow, understanding embeddings gives you direct control over recall, precision, latency, and cost.

2. How embeddings work: vectors, dimensions, and similarity explained simply

A model maps text to a vector like:

"How do I reset my password?" -> [0.018, -0.442, 0.091, ...]

That vector has N dimensions. Larger dimensions can capture richer signals, but they increase storage and retrieval cost.

Core concepts:

  1. Distance / similarity: closer vectors mean more semantic similarity.
  2. Cosine similarity: compares direction, widely used for text embeddings.
  3. Dot product: popular in ANN systems with normalized vectors.
  4. Euclidean distance: used in some pipelines, less common for normalized text embeddings.

Practical rule: normalize vectors and use cosine or dot product unless your model docs recommend otherwise.

Small intuition example:

  1. “refund policy” and “return policy” vectors should be close.
  2. “GPU kernel panic” should be far from “invoice payment terms.”

If this behavior is not happening, your model choice, chunking, or preprocessing is likely wrong.

3. Text embeddings: the models that matter in 2026

In 2026, most builders use a mix of hosted APIs and open models:

  1. OpenAI text-embedding-3-small for cost-sensitive high-volume retrieval.
  2. OpenAI text-embedding-3-large for stronger multilingual and hard semantic tasks.
  3. Google gemini-embedding-001 as a strong general embedding option in Google ecosystems.
  4. Google text-embedding-005 and text-multilingual-embedding-002 in existing Vertex pipelines.
  5. Cohere embed-v4.0 for text/image/mixed pipelines, plus embed-english-v3.0 and embed-multilingual-v3.0 in existing stacks.
  6. Open-source options such as BAAI/bge-m3, intfloat/e5-large-v2, and jinaai/jina-embeddings-v3 for self-hosted control.

You do not need every model. Pick two candidates, run retrieval evals on your own data, and choose by quality-per-dollar.

4. Embedding models compared: OpenAI, Cohere, Google, open-source options

Pricing and capabilities change, so verify before production commitment. Table below uses officially published information where available and clearly marks what must be re-checked.

ProviderModelStrengthsTypical dimensionsPricing signal
OpenAItext-embedding-3-smallStrong quality at very low cost for many RAG use cases1536 defaultPublished: about $0.02 / 1M input tokens (verify current pricing page)
OpenAItext-embedding-3-largeBetter quality for harder retrieval and multilingual tasksup to 3072Published: about $0.13 / 1M input tokens (verify current pricing page)
Googlegemini-embedding-001High-quality unified text/code/multilingual embedding in Google stackup to 3072Gemini API published paid tier around $0.15 / 1M input tokens, batch lower (verify)
Googletext-embedding-005English/code specialized compatibility modelup to 768Check Vertex/Gemini pricing docs
Cohereembed-v4.0Multimodal support (text + image + mixed docs), long contextconfigurable, default 1536Check Cohere pricing page for current embed rates
Cohereembed-english-v3.0 / embed-multilingual-v3.0Mature v3 family used in many existing deployments1024Check Cohere pricing page
Open-sourcebge-m3, e5-large-v2, jina-embeddings-v3Full control, self-hosting, no per-token API feevariesInfra cost only (GPU/CPU + ops)

Honest performance guidance:

  1. API models usually win on convenience and baseline quality consistency.
  2. Open-source models can be very competitive when tuned and served well.
  3. Most teams underperform because of weak chunking and evaluation, not because they picked the “wrong” top-tier model.

5. How to choose the right embedding model for your use case

Use this decision flow:

  1. Need fastest time-to-production and minimal infra? Start with hosted API.
  2. Need strict data residency, custom infra, or lower marginal cost at high volume? Evaluate open-source self-hosted.
  3. Need multilingual + code + long-context coverage in one model? Favor unified modern embedding models.
  4. Need multimodal document retrieval? Choose a model with explicit text+image support.

Selection criteria that actually matter:

  1. Retrieval quality on your corpus: Recall@k, MRR, nDCG.
  2. End-to-end latency: embed + search + rerank.
  3. Cost per successful answer, not cost per token alone.
  4. Operational fit: SDK, quotas, monitoring, incident workflow.

Practical advice:

  1. Benchmark 2-3 models only.
  2. Use same chunking and same vector DB during comparison.
  3. Evaluate on real production-like queries, not synthetic toy prompts.

6. Generating embeddings in Python: working code examples

OpenAI API example

pip install -U openai numpy
from openai import OpenAI
import numpy as np
 
client = OpenAI()
texts = [
    "How do I reset my password?",
    "I cannot sign in to my account.",
    "What is your refund policy?",
]
 
resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
)
 
vectors = [row.embedding for row in resp.data]
print("num_vectors:", len(vectors), "dim:", len(vectors[0]))
print("dtype:", type(vectors[0][0]))
print("avg_norm:", float(np.mean([np.linalg.norm(v) for v in vectors])))

Open-source local example

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("BAAI/bge-m3")
texts = ["password reset issue", "invoice payment terms"]
vectors = model.encode(texts, normalize_embeddings=True)
 
print("num_vectors:", len(vectors))
print("dim:", len(vectors[0]))
print("first_5:", vectors[0][:5])

Both snippets are production-usable starting points. Add retries, monitoring, and batching limits for real traffic.

7. Storing and searching embeddings: vector databases explained

After creating vectors, you store them in a vector index and query by similarity.

Popular choices:

  1. Pinecone: managed, fast to production.
  2. Weaviate: flexible schema + hybrid search.
  3. pgvector: best when your stack is PostgreSQL-first.
  4. Chroma: simple local and MVP workflows.

Minimal local search example with FAISS

pip install -U faiss-cpu numpy
import faiss
import numpy as np
 
# Example normalized vectors (replace with real embeddings)
docs = np.array([
    [0.10, 0.90, 0.10],
    [0.80, 0.10, 0.10],
    [0.12, 0.86, 0.15],
], dtype="float32")
 
query = np.array([[0.11, 0.88, 0.11]], dtype="float32")
 
index = faiss.IndexFlatIP(docs.shape[1])  # dot-product search
index.add(docs)
 
scores, ids = index.search(query, k=2)
print("top_ids:", ids[0].tolist())
print("scores:", scores[0].tolist())

For large-scale systems, move to ANN indexes and add metadata filtering, reranking, and tenant isolation.

8. Embedding pitfalls: what goes wrong and how to fix it

Common failure modes:

  1. Bad chunking: vectors represent mixed topics and retrieval becomes noisy.
  2. Wrong metric: cosine vs dot mismatch hurts ranking.
  3. No normalization: similarity becomes unstable.
  4. Domain mismatch: generic embedding model on highly specialized text.
  5. No evaluation loop: quality regresses silently.
  6. Metadata ignored: results leak across tenant/product boundaries.

Fix strategy:

  1. Start with clean chunking (~500-1000 tokens + overlap depending on docs).
  2. Normalize vectors if your model/index expects it.
  3. Add reranker for top-k candidate refinement.
  4. Run weekly retrieval regression tests.
  5. Track bad queries and feed them into evaluation sets.

If retrieval quality is poor, fix data and indexing before replacing your whole model stack.

9. Multimodal embeddings: text, images, and code

Multimodal embeddings place different data types in a shared vector space. This enables:

  1. Text-to-image retrieval.
  2. Screenshot-to-doc matching.
  3. Cross-modal recommendations.

Code embeddings are also increasingly important:

  1. Natural language query -> code snippet retrieval.
  2. Similar function detection.
  3. Security/code review triage workflows.

Practical guidance:

  1. Use specialized multimodal embedding models for image-heavy pipelines.
  2. Keep separate indexes for different modalities unless your model is truly joint-space by design.
  3. Evaluate each modality with its own ground-truth set.

Multimodal search is powerful, but operational complexity rises quickly. Start narrow, prove value, then expand.

10. What to learn next

After mastering embeddings, focus on retrieval evaluation, reranking, and hybrid search. Build a benchmark dataset from real production queries and track Recall@k and answer success rate over time. Learn ANN indexing tradeoffs, metadata filtering, and semantic caching. If you are building RAG, combine strong embeddings with prompt design and output validation. The best systems are not built from one “best model,” but from disciplined evaluation and iteration.

Suggested next steps:

  1. Build a 100-query retrieval benchmark.
  2. Compare two embedding models on identical data.
  3. Add reranking and measure gain.
  4. Add cost dashboards per feature.

Next article

Fine-tuning LLMs: complete guide to LoRA, QLoRA, and when to fine-tune (2026)

A practical guide to fine-tuning large language models in 2026. Covers LoRA, QLoRA, dataset creation, and an honest framework for when fine-tuning beats RAG.