Fine-Tuning vs RAG: Which Approach Is Right for Your Production App?

When you need an LLM to work reliably on your specific domain, two approaches come up repeatedly: fine-tuning and retrieval-augmented generation (RAG). Both can improve model performance on your use case. They do it in fundamentally different ways, and picking the wrong one costs you time, money, and model quality.

This post walks through when each approach is the right tool, how to compare their costs, and when a hybrid makes sense. The target audience is engineers who already understand the basics and need a concrete decision framework.

What Fine-Tuning Actually Does

Fine-tuning continues training a pretrained model on your data. You're adjusting the weights of the network so that the model's behavior more closely matches what you want. The result is a model that has internalized your domain knowledge -- its response patterns, its vocabulary, its output format.

What this is good for:

Teaching the model a new style or format (structured outputs, a specific JSON schema, a branded voice)
Instilling domain-specific reasoning patterns (how to interpret a lab result, how to handle edge cases in your legal ontology)
Situations where you want the model to behave a certain way without being told in every prompt
Reducing token overhead when system prompts would otherwise be long and repetitive

What it is not good for:

Injecting specific facts that change frequently (product prices, recent events, updated documentation)
Giving the model access to content that wasn't available at training time
Situations where you need the model to cite or reference specific source documents

The reason fine-tuning struggles with facts is baked into how neural networks store information. Facts are distributed across weights, not indexed. You can't reliably update a single fact in a fine-tuned model without risking interference with other weights. If your product catalog changes monthly, fine-tuning on it means re-running fine-tuning every month.

What RAG Actually Does

RAG retrieves relevant documents at inference time and injects them into the context window before generation. The model itself doesn't change. You're giving it better information to work with on each request.

The basic pipeline: user query comes in, you embed it, you search a vector store (or lexical index, or hybrid), you pull the top-K chunks, you prepend them to the prompt, and the model generates a response that can draw on that retrieved content.

What this is good for:

Grounding responses in your actual knowledge base (documentation, wikis, support tickets)
Factual accuracy over content that changes over time
Citing sources -- the model can reference the chunks you retrieved
Scenarios where the knowledge domain is large and would require massive fine-tuning datasets to cover

What RAG is not good for:

Teaching the model how to reason rather than what to know
Situations where every query needs the same formatting behavior -- you can put instructions in the system prompt, but RAG adds latency and complexity you may not need
Ultra-low-latency applications where the retrieval round-trip is a meaningful fraction of your budget

The failure mode in RAG is retrieval quality. If the retrieval step returns irrelevant chunks, the model will hallucinate or give low-quality answers even if the right information exists somewhere in your corpus. Garbage in, garbage out applies here as much as anywhere else.

Decision Framework

Before picking an approach, answer four questions:

1. Does the knowledge change frequently?

If yes, RAG is usually the better fit. You update your vector store, not your model weights. If the knowledge is stable (how to format a medical procedure code, what your API error handling should look like), fine-tuning handles it more reliably and without inference-time retrieval overhead.

2. Do you need behavioral change or information access?

Fine-tuning changes how the model behaves. RAG changes what the model knows at inference time. If your problem is that the model doesn't respond in the right format, doesn't follow your internal conventions, or doesn't handle domain-specific reasoning correctly -- fine-tuning addresses the root cause. If the problem is that the model doesn't know about your company's products or your internal documentation -- RAG addresses that.

3. How large is your knowledge base, and how often do you need it?

A 10,000-document knowledge base is reasonable for RAG. You won't fine-tune 10,000 documents into a model effectively (the training data needs to be structured as examples, not raw documents). Conversely, if your knowledge base has 50 key concepts and every query needs most of them, you might be better off fine-tuning them in than paying retrieval cost on every request.

4. What are your latency requirements?

RAG adds at least one round-trip to your inference pipeline: the embedding call, the vector search, and the time to assemble the augmented prompt. On fast infrastructure this is 20-100ms. If you're building a voice assistant with a 500ms total budget, that's 10-20% of your latency budget before you've made the first LLM call. Fine-tuned models don't have this overhead -- the knowledge is in the weights.

Cost Comparison

The costs look different depending on which direction you're optimizing for.

Fine-Tuning Costs

Fine-tuning has two cost centers: training and serving.

Training cost depends on model size, dataset size, and number of epochs. A rough estimate for LoRA fine-tuning on a 7B model with 10K examples:

Model:     Llama 3.1 8B
Dataset:   10,000 examples (~500 tokens each = ~5M tokens)
Method:    LoRA (rank 16, alpha 32)
Hardware:  1x H100
Time:      ~2-4 hours
Cost:      ~$10-20 at current H100 spot rates

Full fine-tuning is 4-10x more expensive and usually not worth it when LoRA gets you most of the way there. QLoRA reduces VRAM requirements but adds overhead from quantization.

Serving cost is where fine-tuning gets interesting. A fine-tuned model doesn't require a longer context window than the base model. If your alternative was a 2,000-token system prompt that you now replaced with fine-tuned behavior, you're saving those tokens on every request. At scale (millions of requests per month), that adds up.

The re-training cost is the recurring expense. If you need to update behavior quarterly, factor in $20-100 per training run depending on model size and dataset complexity.

RAG Costs

RAG costs have three components: embedding, storage, and retrieval.

Embedding your corpus is a one-time cost per ingestion run. A 10,000-document corpus at 500 tokens per document is 5M tokens. At typical embedding model prices, this is a few dollars per ingestion run.

Vector storage is cheap. Pinecone, Weaviate, Qdrant, and open-source alternatives like pgvector can store millions of vectors for a few dollars per month.

The real cost in RAG is the added context. If retrieval returns 1,500 tokens of context per query, that's 1,500 tokens added to every prompt. On a high-traffic application running 10 million queries per month, you're adding 15 billion input tokens per month. At current rates for fast open-source models, that's $150-750 per month depending on which model you're using.

| Approach | Upfront | Recurring | Per-request overhead | |---|---|---|---| | Fine-tuning | $20-500 per training run | Re-training when knowledge changes | None (lower latency) | | RAG | Embedding corpus (~$2-10) | Vector storage (~$5-50/month) | +500-2000 input tokens per query | | Hybrid | Both upfront costs | Both recurring costs | RAG overhead only on retrieval path |

At low traffic (under 100K requests/month), cost difference is usually negligible. At high traffic, fine-tuning tends to win on per-request cost if the knowledge is stable enough that retraining is infrequent.

When to Use Both

The hybrid approach uses fine-tuning to establish baseline behavior and RAG to provide factual grounding at inference time.

A concrete example: a customer support bot. You fine-tune the model to respond in your brand voice, follow your escalation policies, and format responses correctly. You use RAG to pull in the specific product documentation, knowledge base articles, or account information relevant to each query. The fine-tuning handles the stable behavior; the RAG handles the dynamic content.

This works well when you have two distinct problems. If you only have one, start with the simpler approach.

The operational overhead of a hybrid system is real. You're maintaining a training pipeline, a vector store, an embedding pipeline, and the orchestration layer that ties them together. Before committing to both, make sure the performance improvement justifies the complexity.

Common Mistakes

Fine-tuning on examples that contain the facts. Fine-tuning doesn't reliably encode specific factual content for later retrieval. If your examples contain product SKUs, prices, or other structured facts, the model may learn surface patterns but won't reliably recall those facts at inference time. Use RAG for facts.

Using RAG to fix a formatting problem. If the issue is that the model keeps writing verbose responses when you want concise ones, putting formatting instructions in retrieved chunks is the wrong fix. Fine-tune or use a system prompt.

Under-investing in retrieval quality. RAG is only as good as your retrieval. A lot of teams spend time on the generation side and treat retrieval as a solved problem. Chunking strategy, embedding model choice, and query expansion matter a lot. Evaluate retrieval quality independently (precision@K, recall@K) before blaming the generation step.

Re-training too infrequently with fine-tuning. If your use case requires accuracy on facts or policies that change over time and you're using fine-tuning, you need a clear schedule for retraining. A stale fine-tuned model can confidently give outdated information.

Making the Decision

If you're still uncertain after working through the framework above, start with RAG. The iteration cycle is faster (update your corpus, no re-training), the infrastructure is well-understood, and you can switch to fine-tuning later once you've characterized what kind of errors your system is making.

If you're seeing errors that look like behavioral issues (wrong format, wrong reasoning pattern, off-brand responses) rather than factual gaps, add fine-tuning to the mix.

For most production applications, the answer ends up being some combination over time. Start simple, measure what's failing, and add complexity when you have a clear reason to.

General Compute's API supports both approaches: fast inference for fine-tuned open-source models, plus the low-latency per-token pricing that keeps RAG pipelines economical at scale. If you want to run your fine-tuned Llama or Qwen model, or need high-throughput inference for a RAG-heavy application, check out the GeneralCompute API and get started in a few lines of code.