Agent Readout
Embedding Models in Production: Choosing the Right One for Your App

A practical guide to picking an embedding model for production: MTEB benchmarks, head-to-head comparisons of BGE, Nomic, E5, Cohere, and OpenAI models, multilingual considerations, and code to get started.
Author: General Compute
Published: 2026-06-25
Tags: embedding model, rag, vector search, sentence transformers, mteb, production ai, python
Markdown body


Choosing an embedding model feels like a small decision, but it shapes retrieval quality, latency, cost, and how well your application handles edge cases. Swap from a weak model to a strong one and your RAG pipeline's answer quality can improve noticeably -- without changing the chunking strategy, vector database, or LLM. The choice matters.

This guide covers the MTEB leaderboard and how to read it, the main open-source and managed options (BGE, Nomic, E5, Cohere, OpenAI), what multilingual support actually looks like in practice, and a decision framework to help you pick the right model for what you're building.

## What Embedding Models Actually Do

An embedding model converts text into a dense vector -- a fixed-length list of floats. Texts with similar meanings end up with vectors that are close together in that high-dimensional space. When you search a vector database, you're asking: "which stored vectors are closest to the query vector?"

The quality of embeddings determines whether semantically related passages actually cluster together. A weak embedding model might place "myocardial infarction" far from "heart attack" because it hasn't learned the semantic overlap. A strong one puts them close.

Three properties matter most in production:

- **Retrieval quality**: how well the model ranks relevant documents above irrelevant ones
- **Latency**: how fast it can embed a batch of texts
- **Dimension count**: higher dimensions usually mean better quality but more storage and slower similarity search

Most models offer a fixed set of these trade-offs. Understanding them helps you stop defaulting to whatever was in the tutorial you read last month.

## The MTEB Leaderboard

[MTEB](https://huggingface.co/spaces/mteb/leaderboard) (Massive Text Embedding Benchmark) is the standard benchmark for embedding models. It covers 58 datasets across 8 task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitext mining.

For most RAG use cases, retrieval and reranking scores are what you care about. The leaderboard shows an average across all tasks, which can obscure a model that is excellent at retrieval but weak at clustering (or vice versa).

Practical things to keep in mind when reading MTEB numbers:

**Higher average does not always mean better for your task.** A model tuned for sentence similarity might score well overall but be mediocre at document retrieval. Filter the leaderboard by the task type that matches your use case.

**English vs. multilingual.** MTEB has separate leaderboards for multilingual models (MMTEB). The top English models often underperform on non-English text. If you serve multiple languages, check the multilingual scores specifically.

**Model size and inference time are not shown.** A model scoring 65 average with 128M parameters is very different from one scoring 66 with 7B parameters. You have to check those separately.

**Dataset contamination is a real concern.** Some models are trained on data that overlaps with MTEB test sets. Take leaderboard positions with some skepticism; try to find models with demonstrated performance on data similar to yours.

## The Main Contenders

### BGE (BAAI/bge-*)

BGE models from BAAI (Beijing Academy of Artificial Intelligence) have been consistently strong on MTEB since their release. The family includes several variants worth knowing:

- **bge-large-en-v1.5** (1024 dims, 335M params): top-tier English retrieval, solid choice for English-only applications
- **bge-m3** (1024 dims, 568M params): multilingual, supports 100+ languages, also strong on English -- arguably the best general-purpose open-source choice in 2026
- **bge-small-en-v1.5** (384 dims, 33M params): compact, fast, reasonable quality for applications where latency is the primary constraint

BGE models support three retrieval modes via different pooling strategies: dense retrieval, sparse retrieval (like BM25), and multi-vector (ColBERT-style). `bge-m3` can do all three in one pass, which is handy for hybrid search pipelines.

Licensing: Apache 2.0 -- safe for commercial use.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

# BGE models benefit from an instruction prefix for asymmetric retrieval tasks
passages = ["How does attention work?", "What is a transformer?"]
queries = ["explain attention mechanism"]

# Encode passages and queries separately when doing retrieval
passage_embeddings = model.encode(passages, normalize_embeddings=True)
query_embeddings = model.encode(queries, normalize_embeddings=True)
```

### Nomic Embed (nomic-ai/nomic-embed-text-v1.5)

Nomic's embed model gets less attention than BGE but deserves serious consideration for production use. Key attributes:

- 768 dimensions, 137M parameters
- Supports matryoshka representation learning (MRL) -- you can truncate to 256, 128, or 64 dimensions with a relatively small quality penalty, which cuts storage and search costs
- Strong MTEB retrieval scores that are competitive with models 2-3x larger
- Apache 2.0 license, fully open weights

The MRL feature is genuinely useful. If you're storing hundreds of millions of vectors, being able to use 256-dimensional embeddings instead of 768 with a 5-8% quality reduction can meaningfully reduce your vector database costs.

```python
from sentence_transformers import SentenceTransformer

# The model uses task-specific prefixes
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

def embed_documents(texts):
    return model.encode(
        [f"search_document: {t}" for t in texts],
        normalize_embeddings=True
    )

def embed_query(query):
    return model.encode(
        f"search_query: {query}",
        normalize_embeddings=True
    )
```

Note the task prefixes: `search_document:` for things you're indexing and `search_query:` for queries. Nomic trained the model with these prefixes and skipping them degrades retrieval quality.

### E5 (intfloat/e5-*)

The E5 family from Microsoft covers a range of sizes. The variants most people use:

- **e5-large-v2** (1024 dims): strong English performance, good for applications that don't need multilingual support
- **multilingual-e5-large** (1024 dims): excellent multilingual retrieval, competitive with bge-m3 on many languages
- **e5-mistral-7b-instruct** (4096 dims): a 7B model fine-tuned from Mistral, currently one of the top-performing models on MTEB -- but it costs significantly more to run

The 7B instruct variant is interesting but mostly useful when you're indexing at batch time and have GPU resources available. Running 7B-scale inference for query embedding in a latency-sensitive path is painful. Most teams end up using it for offline indexing and a smaller model for real-time query encoding.

Like Nomic, E5 models expect task prefixes: "query: " and "passage: " for asymmetric retrieval tasks.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-large")

# Prefix required for retrieval tasks
docs = [f"passage: {text}" for text in document_texts]
queries = [f"query: {q}" for q in query_texts]

doc_embeddings = model.encode(docs, normalize_embeddings=True, batch_size=32)
query_embeddings = model.encode(queries, normalize_embeddings=True)
```

### Cohere Embed v3

Cohere's managed embedding API is the strongest commercial option for multilingual use cases. Embed v3 supports 100+ languages with consistent quality across all of them -- something that remains hard for open-source models to match at this scale.

Dimensions: 1024 (with support for 256 and 64 via MRL truncation in the `embed-multilingual-v3.0` variant).

The API requires you to specify an `input_type` parameter: `"search_document"`, `"search_query"`, `"classification"`, or `"clustering"`. This matters -- the model was trained with these modes and you'll see lower quality if you skip it.

```python
import cohere

co = cohere.Client("your-api-key")

def embed_documents(texts):
    response = co.embed(
        texts=texts,
        model="embed-multilingual-v3.0",
        input_type="search_document",
    )
    return response.embeddings

def embed_query(query):
    response = co.embed(
        texts=[query],
        model="embed-multilingual-v3.0",
        input_type="search_query",
    )
    return response.embeddings[0]
```

The downside is cost and vendor lock-in. At high volume, managed API pricing adds up relative to self-hosting bge-m3 on your own GPU.

### OpenAI text-embedding-3-small and text-embedding-3-large

OpenAI's third-generation embedding models are widely used because they work well and require no infrastructure -- you're already using the OpenAI SDK for your LLM calls. They also support dimension reduction via MRL truncation using the `dimensions` parameter.

- **text-embedding-3-small** (1536 dims, or truncated): good quality at low cost, roughly $0.02 per million tokens
- **text-embedding-3-large** (3072 dims, or truncated): near-state-of-the-art on MTEB, $0.13 per million tokens

For English-only applications at moderate scale, `text-embedding-3-small` is hard to beat on the cost-quality curve. At very high scale, self-hosted open-source models become cheaper.

```python
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def embed_texts(texts, model="text-embedding-3-small", dimensions=1536):
    response = client.embeddings.create(
        input=texts,
        model=model,
        dimensions=dimensions,
    )
    return [item.embedding for item in response.data]
```

## Multilingual Options: What Actually Works

If your application needs to handle text in multiple languages, your embedding model needs multilingual training data. Simply using an English-focused model on French or Chinese text produces poor retrieval.

The options in order of quality-to-cost:

1. **bge-m3** -- best open-source multilingual option. Strong across European languages, Chinese, Japanese, Korean, Arabic. Apache 2.0.
2. **multilingual-e5-large** -- competitive with bge-m3, especially for retrieval. MIT license.
3. **Cohere embed-multilingual-v3.0** -- consistently strong across 100+ languages, including lower-resource ones that bge-m3 and e5 may not cover well. Managed API.
4. **OpenAI text-embedding-3-large** -- solid multilingual performance, especially for languages with significant presence in web text.

One nuance: cross-lingual retrieval (query in English, documents in French) is harder than same-language retrieval. Cohere tends to handle this better than open-source models, but bge-m3 is competitive for the major language pairs.

If you're building for a single non-English language with a large user base (Chinese, German, Japanese), check whether there's a language-specific model that outperforms the multilingual options. Models like `Alibaba-NLP/gte-Qwen2-7B-instruct` often outperform general multilingual models for Chinese text.

## Comparison Table

| Model | Dims | Params | License | MTEB Avg | Best For |
|---|---|---|---|---|---|
| BAAI/bge-m3 | 1024 | 568M | Apache 2.0 | 66.1 | Multilingual, hybrid search |
| nomic-embed-text-v1.5 | 768 | 137M | Apache 2.0 | 62.4 | Cost-effective production use |
| intfloat/multilingual-e5-large | 1024 | 560M | MIT | 64.6 | Multilingual retrieval |
| intfloat/e5-mistral-7b-instruct | 4096 | 7B | MIT | 66.6 | Highest quality, offline indexing |
| Cohere embed-multilingual-v3.0 | 1024 | N/A | Commercial | ~65 | Enterprise multilingual |
| OpenAI text-embedding-3-small | 1536 | N/A | Commercial | 62.3 | Low-cost managed option |
| OpenAI text-embedding-3-large | 3072 | N/A | Commercial | 64.6 | High-quality managed option |

MTEB averages vary depending on which tasks are included in the calculation. Treat these as rough guidance rather than precise rankings.

## How to Choose

A few questions that narrow the field quickly:

**Do you need multilingual support?** Use bge-m3, multilingual-e5-large, or Cohere. Rule out English-only models regardless of their English MTEB scores.

**Do you need to self-host?** Open-source models (BGE, Nomic, E5) can run on your own infrastructure. If you need air-gapped deployment or have data residency requirements, this is non-negotiable.

**What's your throughput?** For batch indexing jobs, a larger model (e5-mistral-7b, bge-m3) is fine. For real-time query encoding at high QPS, you want something smaller -- bge-small, nomic-embed -- or a managed API that handles scaling for you.

**How many vectors are you storing?** Higher-dimensional models use more storage. 100M vectors at 1024 dimensions (float32) is 400GB before any database overhead. At that scale, MRL truncation (Nomic to 256 dims, OpenAI to 512) can cut storage costs meaningfully with modest quality loss.

**What's your latency budget for indexing?** If you're re-embedding your corpus frequently, throughput matters. bge-small processes roughly 3x-5x more texts per second than bge-large on the same hardware.

## A Practical Starting Point

For most RAG applications in English: start with `nomic-embed-text-v1.5` or `BAAI/bge-large-en-v1.5`. Both are strong, Apache-licensed, and well-supported by `sentence-transformers`.

For multilingual applications: `BAAI/bge-m3`. It's the most versatile open-source option and handles hybrid search without needing a separate sparse retrieval model.

If you want a managed API and aren't concerned about lock-in: `text-embedding-3-small` is the easiest path. Upgrade to `text-embedding-3-large` if you're seeing retrieval quality issues.

Here's a minimal setup you can adapt:

```python
from sentence_transformers import SentenceTransformer
from openai import OpenAI

# Option 1: Self-hosted with sentence-transformers
class LocalEmbedder:
    def __init__(self, model_name="BAAI/bge-m3"):
        self.model = SentenceTransformer(model_name)

    def embed(self, texts, is_query=False):
        # bge-m3 uses instruction prefixes for asymmetric retrieval
        if is_query:
            texts = [f"Represent this sentence for searching relevant passages: {t}" for t in texts]
        return self.model.encode(texts, normalize_embeddings=True, batch_size=64).tolist()


# Option 2: OpenAI managed API
class OpenAIEmbedder:
    def __init__(self, model="text-embedding-3-small"):
        self.client = OpenAI()
        self.model = model

    def embed(self, texts, is_query=False):
        response = self.client.embeddings.create(input=texts, model=self.model)
        return [item.embedding for item in response.data]
```

Both classes expose the same interface, so you can swap them without changing downstream code.

## Evaluating Your Choice

Don't rely solely on MTEB scores. Run your own evaluation on a sample of your actual data:

1. Take 50-100 representative queries from your application.
2. For each query, manually identify the 3-5 most relevant documents in your corpus.
3. Embed everything with the candidate model and compute retrieval metrics: Recall@5 (fraction of relevant docs in the top 5 results) and MRR (mean reciprocal rank).
4. Compare models head-to-head on these numbers.

A model that scores 2 points higher on MTEB but performs worse on your specific domain is the worse choice. Your evaluation data is more predictive than a benchmark built on generic web text.

## Getting the Generation Side Right

Choosing a strong embedding model solves the retrieval half of the problem. The generation half -- turning retrieved chunks into accurate, fast answers -- depends on your LLM inference layer.

Slow inference hurts RAG specifically: every user query goes embedding + vector search + LLM generation in sequence. If the LLM step takes 3 seconds, your total latency is already too high for most interactive applications regardless of how fast retrieval is. GeneralCompute's API provides low-latency generation on Llama 4, Qwen3, and other leading open models, and it's OpenAI-compatible -- change `base_url` and `api_key` and your existing code works:

```python
from openai import OpenAI

client = OpenAI(
    api_key="your-generalcompute-api-key",
    base_url="https://api.generalcompute.com/v1",
)
```

Solid embeddings plus fast generation gives you the retrieval quality and response speed that make a RAG application actually usable. Start with [generalcompute.com](https://generalcompute.com) to get API access.