Agent Readout

How to Build a RAG Pipeline Using Open-Source Models

A complete walkthrough of building a retrieval-augmented generation pipeline with open-source embedding and LLM models: ingestion, chunking, vector storage, retrieval, generation, and evaluation with RAGAS.

Author
General Compute
Published
2026-06-22
Tags
rag with open source models, retrieval augmented generation, vector database, embeddings, ragas, python, tutorial

Markdown body


Retrieval-augmented generation (RAG) lets you ground an LLM's answers in a specific document corpus without retraining the model. The basic idea: when a user asks a question, you search your document store for relevant chunks, prepend those chunks to the prompt, and let the LLM answer using that retrieved context.

Building this pipeline with open-source models means you control every component: the embedding model, the vector store, and the language model doing the generation. This guide walks through each stage with working Python code, covers the chunking decisions that most affect retrieval quality, and finishes with RAGAS evaluation so you can measure whether the pipeline actually works.

## The Pipeline at a Glance

A RAG pipeline has six stages:

1. **Ingestion** -- load and parse raw documents
2. **Chunking** -- split documents into retrievable pieces
3. **Embedding** -- convert chunks to dense vectors
4. **Indexing** -- store vectors in a searchable database
5. **Retrieval** -- find chunks relevant to a query
6. **Generation** -- produce an answer using retrieved chunks as context

Each stage has meaningful choices. The sections below cover the important ones.

## Stage 1: Document Ingestion

The most common input formats are PDFs, plain text files, HTML pages, and Markdown. For this walkthrough we'll use a set of Markdown and PDF files. The `unstructured` library handles both without much ceremony:

```python
from unstructured.partition.auto import partition
from pathlib import Path

def load_documents(paths: list[str]) -> list[dict]:
    docs = []
    for path in paths:
        elements = partition(filename=path)
        text = "\n\n".join(str(e) for e in elements if str(e).strip())
        docs.append({"source": path, "text": text})
    return docs
```

For HTML content, `BeautifulSoup` is often simpler and faster than `unstructured`:

```python
import requests
from bs4 import BeautifulSoup

def load_url(url: str) -> dict:
    html = requests.get(url, timeout=10).text
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = soup.get_text(separator="\n", strip=True)
    return {"source": url, "text": text}
```

Clean the text before chunking. Remove repeated whitespace, strip headers and footers that appear on every page, and normalize encoding. Dirty text produces noisy embeddings and confuses the retrieval step.

## Stage 2: Chunking

Chunking is where most RAG pipelines go wrong. A chunk that is too small loses context; a chunk that is too large dilutes relevance and wastes tokens in the prompt.

**Recursive character splitting** is the most reliable general approach. It tries to split on paragraph breaks first, then sentence boundaries, then individual characters, so chunks respect natural text structure:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def chunk_documents(docs: list[dict]) -> list[dict]:
    chunks = []
    for doc in docs:
        splits = splitter.split_text(doc["text"])
        for i, text in enumerate(splits):
            chunks.append({
                "text": text,
                "source": doc["source"],
                "chunk_id": f"{doc['source']}::{i}",
            })
    return chunks
```

Some things worth knowing about chunk size:

- **512 tokens** works well for factual Q&A. You get precise retrieval hits.
- **1024 tokens** is better for summarization tasks where broad context matters.
- Overlap (here 64 tokens) prevents a sentence from being cut in half between two chunks, which would make neither chunk retrievable.

For structured content like code or tables, consider splitting by logical boundaries (function definitions, table rows) rather than character count. The recursive splitter handles prose well but fights structure.

## Stage 3: Embedding with Open-Source Models

Embedding models convert text to dense vectors that capture semantic meaning. The cosine similarity between two vectors approximates how related the corresponding texts are.

Good open-source choices in 2026:

| Model | Dimensions | Notes |
|---|---|---|
| `BAAI/bge-m3` | 1024 | Multilingual, strong MTEB scores |
| `nomic-ai/nomic-embed-text-v1.5` | 768 | Apache-licensed, good for production |
| `intfloat/e5-mistral-7b-instruct` | 4096 | High quality, heavier compute |
| `BAAI/bge-large-en-v1.5` | 1024 | English-focused, efficient |

For most applications, `bge-m3` or `nomic-embed-text-v1.5` hit the right balance of quality and speed. Here's the embedding step using `sentence-transformers`:

```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-m3")

def embed_chunks(chunks: list[dict]) -> list[dict]:
    texts = [c["text"] for c in chunks]
    embeddings = model.encode(
        texts,
        batch_size=64,
        show_progress_bar=True,
        normalize_embeddings=True,  # important for cosine similarity
    )
    for chunk, embedding in zip(chunks, embeddings):
        chunk["embedding"] = embedding.tolist()
    return chunks
```

Normalize embeddings before storing them. When embeddings are L2-normalized, cosine similarity equals dot product, which most vector databases compute more efficiently.

For production workloads with large corpora, batching embedding calls matters. Running embeddings one at a time is 10-50x slower than batching on the same hardware.

## Stage 4: Vector Databases

Vector databases handle approximate nearest-neighbor (ANN) search over millions of embeddings efficiently. Common options:

**Qdrant** -- self-hosted or managed, strong Python client, efficient for medium-to-large corpora.

**Chroma** -- easiest to set up locally, good for development and smaller datasets.

**Weaviate** -- schema-based, good when you need metadata filtering alongside vector search.

**pgvector** -- Postgres extension that adds vector search. Works well if you're already on Postgres and don't want another service.

Here's a Qdrant setup that handles both ingestion and search:

```python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

COLLECTION = "docs"
VECTOR_SIZE = 1024  # matches bge-m3

def create_collection():
    client.recreate_collection(
        collection_name=COLLECTION,
        vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
    )

def index_chunks(chunks: list[dict]):
    points = [
        PointStruct(
            id=i,
            vector=chunk["embedding"],
            payload={"text": chunk["text"], "source": chunk["source"], "chunk_id": chunk["chunk_id"]},
        )
        for i, chunk in enumerate(chunks)
    ]
    client.upsert(collection_name=COLLECTION, points=points)

def search(query_embedding: list[float], top_k: int = 5) -> list[dict]:
    results = client.search(
        collection_name=COLLECTION,
        query_vector=query_embedding,
        limit=top_k,
        with_payload=True,
    )
    return [{"text": r.payload["text"], "source": r.payload["source"], "score": r.score} for r in results]
```

Metadata filtering is worth enabling early. If your corpus contains documents from multiple sources or time periods, being able to filter by source or date before doing the ANN search dramatically improves precision without hurting recall.

## Stage 5: Retrieval

Retrieval combines embedding the user query and searching the index:

```python
def retrieve(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = model.encode([query], normalize_embeddings=True)[0].tolist()
    return search(query_embedding, top_k=top_k)
```

A few retrieval strategies beyond plain cosine search:

**Hybrid search** mixes dense vector search with BM25 keyword search. This catches cases where exact term matches matter more than semantic similarity (technical identifiers, product names, version numbers). Qdrant supports this natively; Weaviate has a "hybrid" parameter.

**Re-ranking** takes the top-N vector search results and passes them through a cross-encoder model that scores each (query, chunk) pair directly. Cross-encoders are more accurate than bi-encoders but too slow to run against the full index, so you use them to re-order a shortlist. `cross-encoder/ms-marco-MiniLM-L-6-v2` is a compact re-ranker that works well:

```python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_k: int = 3) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
    return [c for _, c in ranked[:top_k]]
```

**Reciprocal rank fusion** combines results from multiple retrieval strategies (dense, sparse, different embedding models) by summing reciprocal ranks. It's a simple, parameter-free way to blend retrieval signals without training a separate combiner model.

## Stage 6: Generation

Once you have retrieved chunks, format them into a prompt and call the LLM. GeneralCompute's API is OpenAI-compatible, so you can use the `openai` SDK pointed at the GeneralCompute endpoint:

```python
from openai import OpenAI

llm_client = OpenAI(
    api_key="your_generalcompute_api_key",
    base_url="https://api.generalcompute.com/v1",
)

def build_context(chunks: list[dict]) -> str:
    sections = []
    for i, chunk in enumerate(chunks, 1):
        sections.append(f"[Source {i}: {chunk['source']}]\n{chunk['text']}")
    return "\n\n---\n\n".join(sections)

def generate_answer(query: str, chunks: list[dict], model: str = "llama-4-maverick") -> str:
    context = build_context(chunks)
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant. Answer the user's question using only the provided context. "
                "If the context does not contain enough information to answer confidently, say so. "
                "Cite the source numbers when you reference specific information."
            ),
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}",
        },
    ]
    response = llm_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.1,  # low temperature for factual retrieval tasks
        max_tokens=1024,
    )
    return response.choices[0].message.content
```

The system prompt matters here. Instructing the model to rely only on the provided context and to acknowledge when it can't answer reduces hallucination. Setting temperature low (0.0-0.2) also helps for factual Q&A.

Putting retrieval and generation together:

```python
def rag_query(query: str) -> dict:
    candidates = retrieve(query, top_k=10)
    top_chunks = rerank(query, candidates, top_k=3)
    answer = generate_answer(query, top_chunks)
    return {
        "query": query,
        "answer": answer,
        "sources": [c["source"] for c in top_chunks],
    }
```

## Stage 7: Evaluation with RAGAS

Shipping a RAG pipeline without an evaluation loop means you're debugging in production. RAGAS is an open-source framework that evaluates RAG pipelines using LLM-as-judge metrics without needing hand-labeled answers.

Install it:

```bash
pip install ragas datasets
```

RAGAS computes four core metrics:

- **Faithfulness** -- does the answer stick to what the retrieved context actually says? Measures hallucination.
- **Answer relevance** -- how well does the answer address the question asked?
- **Context precision** -- are the retrieved chunks actually relevant to the question? Measures retrieval quality.
- **Context recall** -- given reference answers, how much of the needed information appears in the retrieved context?

You need a dataset of (question, answer, contexts, ground_truth) tuples. For an initial evaluation you can manually create 20-50 test cases or generate them from your documents using an LLM:

```python
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Build an evaluation dataset by running your pipeline over test questions
test_questions = [
    "What are the key configuration options for the ingestion pipeline?",
    "How does the system handle duplicate documents?",
    # ... add more
]

eval_data = []
for question in test_questions:
    result = rag_query(question)
    eval_data.append({
        "question": question,
        "answer": result["answer"],
        "contexts": [c["text"] for c in retrieve(question, top_k=3)],
        "ground_truth": "",  # fill in for context_recall; leave empty to skip that metric
    })

dataset = Dataset.from_list(eval_data)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(scores)
```

Typical baselines to aim for: faithfulness above 0.85, answer relevancy above 0.80. If faithfulness is low, the LLM is adding information not in the context -- tighten the system prompt and lower temperature. If context precision is low, retrieval is surfacing irrelevant chunks -- tune chunk size, try hybrid search, or add a re-ranker.

Running RAGAS after every significant change (new chunking strategy, different embedding model, updated prompts) gives you a concrete signal for whether the change helped.

## Putting It All Together

Here's the full pipeline structure as a single module:

```python
# rag_pipeline.py

from pathlib import Path
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from openai import OpenAI

COLLECTION = "docs"
EMBED_MODEL = "BAAI/bge-m3"
VECTOR_SIZE = 1024
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"

embedder = SentenceTransformer(EMBED_MODEL)
reranker = CrossEncoder(RERANK_MODEL)
qdrant = QdrantClient(host="localhost", port=6333)
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
llm = OpenAI(api_key="...", base_url="https://api.generalcompute.com/v1")


def ingest(paths: list[str]):
    qdrant.recreate_collection(
        COLLECTION,
        vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
    )
    chunks = []
    for path in paths:
        elements = partition(filename=path)
        text = "\n\n".join(str(e) for e in elements if str(e).strip())
        for i, chunk_text in enumerate(splitter.split_text(text)):
            chunks.append({"text": chunk_text, "source": path, "id": f"{path}::{i}"})

    texts = [c["text"] for c in chunks]
    embeddings = embedder.encode(texts, batch_size=64, normalize_embeddings=True)
    points = [
        PointStruct(id=i, vector=emb.tolist(), payload={"text": c["text"], "source": c["source"]})
        for i, (c, emb) in enumerate(zip(chunks, embeddings))
    ]
    qdrant.upsert(COLLECTION, points)
    print(f"Indexed {len(chunks)} chunks from {len(paths)} documents.")


def query(question: str) -> dict:
    q_emb = embedder.encode([question], normalize_embeddings=True)[0].tolist()
    hits = qdrant.search(COLLECTION, query_vector=q_emb, limit=10, with_payload=True)
    candidates = [{"text": h.payload["text"], "source": h.payload["source"]} for h in hits]

    pairs = [(question, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    top3 = [c for _, c in sorted(zip(scores, candidates), reverse=True)][:3]

    context = "\n\n---\n\n".join(f"[{c['source']}]\n{c['text']}" for c in top3)
    resp = llm.chat.completions.create(
        model="llama-4-maverick",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        temperature=0.1,
    )
    return {"answer": resp.choices[0].message.content, "sources": [c["source"] for c in top3]}
```

Usage:

```bash
pip install sentence-transformers qdrant-client langchain unstructured openai ragas
docker run -p 6333:6333 qdrant/qdrant
```

```python
from rag_pipeline import ingest, query

ingest(["docs/product_manual.pdf", "docs/faq.md"])
result = query("How do I reset my API key?")
print(result["answer"])
```

## Common Failure Modes

**Retrieval misses the relevant chunk.** Check context precision in RAGAS. Common causes: chunk size too large (dilutes the signal), embedding model doesn't understand the domain, or the query and document use different terminology. Try hybrid search or query expansion (generate multiple phrasings of the question and search with all of them).

**Answer adds information not in the retrieved context.** Faithfulness score below 0.8 is the signal. Add explicit instructions to the system prompt ("Do not add any information that is not directly stated in the context above") and lower temperature.

**Slow retrieval at scale.** Qdrant and similar databases use HNSW indexing, which keeps search fast as the corpus grows. If you're seeing slow searches, check that you're using the indexed collection (not a raw scan) and that your Qdrant instance has enough RAM to keep the index in memory.

**High latency end-to-end.** The bottleneck is usually LLM generation, not retrieval or embedding. Faster inference directly shortens the user-visible response time. GeneralCompute's inference infrastructure is optimized for low TTFT and high token throughput, which helps when you're generating detailed answers from dense context.

## Next Steps

A working pipeline is the starting point. From here:

- Run RAGAS on a representative evaluation set and establish baseline scores before making further changes.
- Add metadata filtering to your vector queries so users can scope searches to specific document sets or date ranges.
- Implement streaming generation so the UI can show tokens as they arrive rather than waiting for the full response.
- Look at query routing if you have multiple document corpora: classify the query first and send it to the appropriate index.

The GeneralCompute API is OpenAI-compatible, so any code that works with `openai` works with GeneralCompute by changing `base_url` and `api_key`. Sign up at [generalcompute.com](https://generalcompute.com) to get API access and start building.
ModeHumanAgent