Back to Blog
AI & ML

Fine-tuning vs RAG: We Tried Both. Here's When Each Actually Makes Sense

Zyptr Admin
22 January 2024
8 min read

The False Binary

Every other blog post frames this as fine-tuning OR RAG. That's not how it works in practice. We've shipped projects that use both, projects that use neither (plain prompt engineering with GPT-4 is underrated), and projects where we started with one and switched to the other mid-way. The right answer depends on your data, your latency requirements, and frankly, your budget.

Let us share what we've actually learned from building six production systems — three with RAG, two with fine-tuning, and one hybrid.

When RAG Is the Clear Winner

RAG wins when your knowledge base changes frequently. We built a customer support system for an e-commerce client where product catalogs, policies, and FAQ answers change weekly. Fine-tuning would mean retraining every week — expensive and slow. With RAG, we just re-index the documents and the system picks up changes within minutes.

RAG also wins when you need citations and traceability. For a legal-tech client, being able to point to the exact clause in a contract that informed the AI's answer was a hard requirement. RAG gives you that for free — the retrieved chunks are your citations. Try getting that from a fine-tuned model. You can't, at least not reliably.

Our typical RAG stack: OpenAI text-embedding-3-small for embeddings (it's cheap and surprisingly good), pgvector on Supabase for the vector store (we stopped using Pinecone for most projects — more on that later), and GPT-4o-mini for generation. Total cost for a mid-traffic app: about $150-300/month.

When Fine-tuning Makes More Sense

Fine-tuning wins when you need the model to adopt a very specific tone, format, or reasoning pattern that's hard to capture in a prompt. We fine-tuned GPT-3.5-turbo for a client who needed medical report summaries in a very specific format — structured sections, specific terminology, a particular level of detail. Prompt engineering got us to 70% accuracy on format compliance. Fine-tuning on 500 examples got us to 96%.

Fine-tuning also wins for latency-sensitive applications. RAG adds a retrieval step — typically 100-300ms for the embedding + vector search. If you're building something that needs sub-200ms total response time, that retrieval tax might be unacceptable. A fine-tuned model that has the knowledge baked in skips that step entirely.

The cost math is different than most people assume. Fine-tuning GPT-3.5-turbo costs roughly $8 per 1M training tokens. But once trained, inference is only slightly more expensive than the base model. If you're making millions of calls, fine-tuning can actually be cheaper than RAG (where you're paying for embedding generation and vector DB hosting on top of the LLM call).

The Hybrid Approach We're Using More Often

Our latest approach for complex projects: fine-tune a model on your domain's style and reasoning patterns, then use RAG to inject current factual information. The fine-tuned model "knows" how to reason about your domain, and RAG provides the specific facts it needs. We used this for a financial analysis tool — the fine-tuned model understood how to interpret financial metrics, and RAG provided the actual company data. Best of both worlds, though it does add complexity to your pipeline.

The Decision Framework

Here's the honest framework we use: If your data changes more than monthly, start with RAG. If you need a very specific output format or tone, fine-tune. If you need both dynamic data and specific behavior, go hybrid but make sure you have the engineering bandwidth to maintain it. And if your dataset is under 1000 documents and doesn't change much — honestly, just stuff it into the system prompt. We've seen teams over-engineer RAG systems for problems that a well-crafted prompt with 20 examples could solve.

ragfine-tuningllmai-architecture
Let's Work Together

Have a Project in Mind?
Great?

Let's talk about building your next product.