How to Build a RAG Pipeline Using Open-Source Models
Retrieval-augmented generation (RAG) lets you ground an LLM's answers in a specific document corpus without retraining the model. The basic idea: when a user asks a question, you search your document store for relevant chunks, prepend those chunks to the prompt, and let the LLM answer using that retrieved context.
Building this pipeline with open-source models means you control every component: the embedding model, the vector store, and the language model doing the generation. This guide walks through each stage with working Python code, covers the chunking decisions that most affect retrieval quality, and finishes with RAGAS evaluation so you can measure whether the pipeline actually works.
The Pipeline at a Glance
A RAG pipeline has six stages:
- Ingestion -- load and parse raw documents
- Chunking -- split documents into retrievable pieces
- Embedding -- convert chunks to dense vectors
- Indexing -- store vectors in a searchable database
- Retrieval -- find chunks relevant to a query
- Generation -- produce an answer using retrieved chunks as context
Each stage has meaningful choices. The sections below cover the important ones.
Stage 1: Document Ingestion
The most common input formats are PDFs, plain text files, HTML pages, and Markdown. For this walkthrough we'll use a set of Markdown and PDF files. The unstructured library handles both without much ceremony:
from unstructured.partition.auto import partition from pathlib import Path def load_documents(paths: list[str]) -> list[dict]: docs = [] for path in paths: elements = partition(filename=path) text = "\n\n".join(str(e) for e in elements if str(e).strip()) docs.append({"source": path, "text": text}) return docs
For HTML content, BeautifulSoup is often simpler and faster than unstructured:
import requests from bs4 import BeautifulSoup def load_url(url: str) -> dict: html = requests.get(url, timeout=10).text soup = BeautifulSoup(html, "html.parser") for tag in soup(["script", "style", "nav", "footer"]): tag.decompose() text = soup.get_text(separator="\n", strip=True) return {"source": url, "text": text}
Clean the text before chunking. Remove repeated whitespace, strip headers and footers that appear on every page, and normalize encoding. Dirty text produces noisy embeddings and confuses the retrieval step.
Stage 2: Chunking
Chunking is where most RAG pipelines go wrong. A chunk that is too small loses context; a chunk that is too large dilutes relevance and wastes tokens in the prompt.
Recursive character splitting is the most reliable general approach. It tries to split on paragraph breaks first, then sentence boundaries, then individual characters, so chunks respect natural text structure:
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " ", ""], ) def chunk_documents(docs: list[dict]) -> list[dict]: chunks = [] for doc in docs: splits = splitter.split_text(doc["text"]) for i, text in enumerate(splits): chunks.append({ "text": text, "source": doc["source"], "chunk_id": f"{doc['source']}::{i}", }) return chunks
Some things worth knowing about chunk size:
- 512 tokens works well for factual Q&A. You get precise retrieval hits.
- 1024 tokens is better for summarization tasks where broad context matters.
- Overlap (here 64 tokens) prevents a sentence from being cut in half between two chunks, which would make neither chunk retrievable.
For structured content like code or tables, consider splitting by logical boundaries (function definitions, table rows) rather than character count. The recursive splitter handles prose well but fights structure.
Stage 3: Embedding with Open-Source Models
Embedding models convert text to dense vectors that capture semantic meaning. The cosine similarity between two vectors approximates how related the corresponding texts are.
Good open-source choices in 2026:
| Model | Dimensions | Notes |
|---|---|---|
| BAAI/bge-m3 | 1024 | Multilingual, strong MTEB scores |
| nomic-ai/nomic-embed-text-v1.5 | 768 | Apache-licensed, good for production |
| intfloat/e5-mistral-7b-instruct | 4096 | High quality, heavier compute |
| BAAI/bge-large-en-v1.5 | 1024 | English-focused, efficient |
For most applications, bge-m3 or nomic-embed-text-v1.5 hit the right balance of quality and speed. Here's the embedding step using sentence-transformers:
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-m3") def embed_chunks(chunks: list[dict]) -> list[dict]: texts = [c["text"] for c in chunks] embeddings = model.encode( texts, batch_size=64, show_progress_bar=True, normalize_embeddings=True, # important for cosine similarity ) for chunk, embedding in zip(chunks, embeddings): chunk["embedding"] = embedding.tolist() return chunks
Normalize embeddings before storing them. When embeddings are L2-normalized, cosine similarity equals dot product, which most vector databases compute more efficiently.
For production workloads with large corpora, batching embedding calls matters. Running embeddings one at a time is 10-50x slower than batching on the same hardware.
Stage 4: Vector Databases
Vector databases handle approximate nearest-neighbor (ANN) search over millions of embeddings efficiently. Common options:
Qdrant -- self-hosted or managed, strong Python client, efficient for medium-to-large corpora.
Chroma -- easiest to set up locally, good for development and smaller datasets.
Weaviate -- schema-based, good when you need metadata filtering alongside vector search.
pgvector -- Postgres extension that adds vector search. Works well if you're already on Postgres and don't want another service.
Here's a Qdrant setup that handles both ingestion and search:
from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct client = QdrantClient(host="localhost", port=6333) COLLECTION = "docs" VECTOR_SIZE = 1024 # matches bge-m3 def create_collection(): client.recreate_collection( collection_name=COLLECTION, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE), ) def index_chunks(chunks: list[dict]): points = [ PointStruct( id=i, vector=chunk["embedding"], payload={"text": chunk["text"], "source": chunk["source"], "chunk_id": chunk["chunk_id"]}, ) for i, chunk in enumerate(chunks) ] client.upsert(collection_name=COLLECTION, points=points) def search(query_embedding: list[float], top_k: int = 5) -> list[dict]: results = client.search( collection_name=COLLECTION, query_vector=query_embedding, limit=top_k, with_payload=True, ) return [{"text": r.payload["text"], "source": r.payload["source"], "score": r.score} for r in results]
Metadata filtering is worth enabling early. If your corpus contains documents from multiple sources or time periods, being able to filter by source or date before doing the ANN search dramatically improves precision without hurting recall.
Stage 5: Retrieval
Retrieval combines embedding the user query and searching the index:
def retrieve(query: str, top_k: int = 5) -> list[dict]: query_embedding = model.encode([query], normalize_embeddings=True)[0].tolist() return search(query_embedding, top_k=top_k)
A few retrieval strategies beyond plain cosine search:
Hybrid search mixes dense vector search with BM25 keyword search. This catches cases where exact term matches matter more than semantic similarity (technical identifiers, product names, version numbers). Qdrant supports this natively; Weaviate has a "hybrid" parameter.
Re-ranking takes the top-N vector search results and passes them through a cross-encoder model that scores each (query, chunk) pair directly. Cross-encoders are more accurate than bi-encoders but too slow to run against the full index, so you use them to re-order a shortlist. cross-encoder/ms-marco-MiniLM-L-6-v2 is a compact re-ranker that works well:
from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") def rerank(query: str, candidates: list[dict], top_k: int = 3) -> list[dict]: pairs = [(query, c["text"]) for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True) return [c for _, c in ranked[:top_k]]
Reciprocal rank fusion combines results from multiple retrieval strategies (dense, sparse, different embedding models) by summing reciprocal ranks. It's a simple, parameter-free way to blend retrieval signals without training a separate combiner model.
Stage 6: Generation
Once you have retrieved chunks, format them into a prompt and call the LLM. GeneralCompute's API is OpenAI-compatible, so you can use the openai SDK pointed at the GeneralCompute endpoint:
from openai import OpenAI llm_client = OpenAI( api_key="your_generalcompute_api_key", base_url="https://api.generalcompute.com/v1", ) def build_context(chunks: list[dict]) -> str: sections = [] for i, chunk in enumerate(chunks, 1): sections.append(f"[Source {i}: {chunk['source']}]\n{chunk['text']}") return "\n\n---\n\n".join(sections) def generate_answer(query: str, chunks: list[dict], model: str = "llama-4-maverick") -> str: context = build_context(chunks) messages = [ { "role": "system", "content": ( "You are a helpful assistant. Answer the user's question using only the provided context. " "If the context does not contain enough information to answer confidently, say so. " "Cite the source numbers when you reference specific information." ), }, { "role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}", }, ] response = llm_client.chat.completions.create( model=model, messages=messages, temperature=0.1, # low temperature for factual retrieval tasks max_tokens=1024, ) return response.choices[0].message.content
The system prompt matters here. Instructing the model to rely only on the provided context and to acknowledge when it can't answer reduces hallucination. Setting temperature low (0.0-0.2) also helps for factual Q&A.
Putting retrieval and generation together:
def rag_query(query: str) -> dict: candidates = retrieve(query, top_k=10) top_chunks = rerank(query, candidates, top_k=3) answer = generate_answer(query, top_chunks) return { "query": query, "answer": answer, "sources": [c["source"] for c in top_chunks], }
Stage 7: Evaluation with RAGAS
Shipping a RAG pipeline without an evaluation loop means you're debugging in production. RAGAS is an open-source framework that evaluates RAG pipelines using LLM-as-judge metrics without needing hand-labeled answers.
Install it:
pip install ragas datasets
RAGAS computes four core metrics:
- Faithfulness -- does the answer stick to what the retrieved context actually says? Measures hallucination.
- Answer relevance -- how well does the answer address the question asked?
- Context precision -- are the retrieved chunks actually relevant to the question? Measures retrieval quality.
- Context recall -- given reference answers, how much of the needed information appears in the retrieved context?
You need a dataset of (question, answer, contexts, ground_truth) tuples. For an initial evaluation you can manually create 20-50 test cases or generate them from your documents using an LLM:
from datasets import Dataset from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall # Build an evaluation dataset by running your pipeline over test questions test_questions = [ "What are the key configuration options for the ingestion pipeline?", "How does the system handle duplicate documents?", # ... add more ] eval_data = [] for question in test_questions: result = rag_query(question) eval_data.append({ "question": question, "answer": result["answer"], "contexts": [c["text"] for c in retrieve(question, top_k=3)], "ground_truth": "", # fill in for context_recall; leave empty to skip that metric }) dataset = Dataset.from_list(eval_data) scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision]) print(scores)
Typical baselines to aim for: faithfulness above 0.85, answer relevancy above 0.80. If faithfulness is low, the LLM is adding information not in the context -- tighten the system prompt and lower temperature. If context precision is low, retrieval is surfacing irrelevant chunks -- tune chunk size, try hybrid search, or add a re-ranker.
Running RAGAS after every significant change (new chunking strategy, different embedding model, updated prompts) gives you a concrete signal for whether the change helped.
Putting It All Together
Here's the full pipeline structure as a single module:
# rag_pipeline.py from pathlib import Path from sentence_transformers import SentenceTransformer, CrossEncoder from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct from langchain.text_splitter import RecursiveCharacterTextSplitter from unstructured.partition.auto import partition from openai import OpenAI COLLECTION = "docs" EMBED_MODEL = "BAAI/bge-m3" VECTOR_SIZE = 1024 RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2" embedder = SentenceTransformer(EMBED_MODEL) reranker = CrossEncoder(RERANK_MODEL) qdrant = QdrantClient(host="localhost", port=6333) splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64) llm = OpenAI(api_key="...", base_url="https://api.generalcompute.com/v1") def ingest(paths: list[str]): qdrant.recreate_collection( COLLECTION, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE), ) chunks = [] for path in paths: elements = partition(filename=path) text = "\n\n".join(str(e) for e in elements if str(e).strip()) for i, chunk_text in enumerate(splitter.split_text(text)): chunks.append({"text": chunk_text, "source": path, "id": f"{path}::{i}"}) texts = [c["text"] for c in chunks] embeddings = embedder.encode(texts, batch_size=64, normalize_embeddings=True) points = [ PointStruct(id=i, vector=emb.tolist(), payload={"text": c["text"], "source": c["source"]}) for i, (c, emb) in enumerate(zip(chunks, embeddings)) ] qdrant.upsert(COLLECTION, points) print(f"Indexed {len(chunks)} chunks from {len(paths)} documents.") def query(question: str) -> dict: q_emb = embedder.encode([question], normalize_embeddings=True)[0].tolist() hits = qdrant.search(COLLECTION, query_vector=q_emb, limit=10, with_payload=True) candidates = [{"text": h.payload["text"], "source": h.payload["source"]} for h in hits] pairs = [(question, c["text"]) for c in candidates] scores = reranker.predict(pairs) top3 = [c for _, c in sorted(zip(scores, candidates), reverse=True)][:3] context = "\n\n---\n\n".join(f"[{c['source']}]\n{c['text']}" for c in top3) resp = llm.chat.completions.create( model="llama-4-maverick", messages=[ {"role": "system", "content": "Answer using only the provided context. Cite sources."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}, ], temperature=0.1, ) return {"answer": resp.choices[0].message.content, "sources": [c["source"] for c in top3]}
Usage:
pip install sentence-transformers qdrant-client langchain unstructured openai ragas docker run -p 6333:6333 qdrant/qdrant
from rag_pipeline import ingest, query ingest(["docs/product_manual.pdf", "docs/faq.md"]) result = query("How do I reset my API key?") print(result["answer"])
Common Failure Modes
Retrieval misses the relevant chunk. Check context precision in RAGAS. Common causes: chunk size too large (dilutes the signal), embedding model doesn't understand the domain, or the query and document use different terminology. Try hybrid search or query expansion (generate multiple phrasings of the question and search with all of them).
Answer adds information not in the retrieved context. Faithfulness score below 0.8 is the signal. Add explicit instructions to the system prompt ("Do not add any information that is not directly stated in the context above") and lower temperature.
Slow retrieval at scale. Qdrant and similar databases use HNSW indexing, which keeps search fast as the corpus grows. If you're seeing slow searches, check that you're using the indexed collection (not a raw scan) and that your Qdrant instance has enough RAM to keep the index in memory.
High latency end-to-end. The bottleneck is usually LLM generation, not retrieval or embedding. Faster inference directly shortens the user-visible response time. GeneralCompute's inference infrastructure is optimized for low TTFT and high token throughput, which helps when you're generating detailed answers from dense context.
Next Steps
A working pipeline is the starting point. From here:
- Run RAGAS on a representative evaluation set and establish baseline scores before making further changes.
- Add metadata filtering to your vector queries so users can scope searches to specific document sets or date ranges.
- Implement streaming generation so the UI can show tokens as they arrive rather than waiting for the full response.
- Look at query routing if you have multiple document corpora: classify the query first and send it to the appropriate index.
The GeneralCompute API is OpenAI-compatible, so any code that works with openai works with GeneralCompute by changing base_url and api_key. Sign up at generalcompute.com to get API access and start building.