Retrieval-Augmented Generation (RAG)
Patterns for grounding an LLM in your own documents (or any external corpus) rather than relying on the model’s training-time knowledge. The fix for hallucination, staleness, and lack of private knowledge in a single architecture.
Coined by Facebook AI Research (Lewis et al., 2020 paper). Now the default integration pattern for production LLM systems wanting verifiable, citable answers.
The Core Loop
User Query
│
▼
[1] Embed query → vector
│
▼
[2] Search vector store / keyword index for top-K relevant chunks
│
▼
[3] Stuff chunks into LLM prompt as CONTEXT
│
▼
[4] LLM generates answer constrained to the context
│
▼
[5] (Optional) cite chunks; return with sources
The whole thing fits in ~50 lines of code if you skip the vector DB and use a simple dot-product over numpy arrays.
Why It Beats Fine-Tuning for Most Cases
| Concern | Fine-tune | RAG |
|---|---|---|
| Fresh data | Re-train (hours-days) | Add to index (seconds) |
| Cost | $$$ per training run | $ per query |
| Provenance | Mixed into weights, opaque | Citable to specific chunks |
| Hallucination | Trained-in is hard to remove | Tunable via prompt constraints |
| Multi-tenant | One model per tenant if private | One model, per-tenant retrieval |
| Reasoning gain | Real | None (knowledge only) |
Fine-tune for new SKILLS or BEHAVIORS. RAG for new KNOWLEDGE.
The Quality Failure Modes
Most failed RAG systems fail at retrieval, not generation:
- Bad chunking — semantic units split mid-sentence; tables broken across chunks; markdown headers stripped
- Bad embeddings — using a model not aligned to your domain (legal corpus retrieved with general-web embeddings)
- No reranking — top-K cosine similarity ≠ top-K relevance; need a cross-encoder pass
- Single-pass retrieval — query goes in once. Good systems decompose, retrieve multiple times, synthesize
- No metadata filtering — retrieving from “all of corp wiki” when you should pre-filter to “engineering pages from 2025”
- Stale index — corpus updated, embeddings not refreshed. Citations become misleading.
Where RAG Doesn’t Help
- Math + computation — LLM still does the reasoning. Wrong-context retrieval can make math worse, not better. Use code-interpreter or function-calling.
- Highly compositional reasoning — multi-hop questions where you need to combine 5 facts. Naive RAG retrieves locally; misses connections. Solutions: GraphRAG, agentic loops.
- Personalized voice/style — RAG gives knowledge, not voice. For voice, you fine-tune or in-context-learn from examples (few-shot).
The Wiki Pattern as a RAG Variant
The Karpathy LLM wiki pattern (this entire vault) is “RAG with extreme curation”: instead of dumping raw documents into a vector index, an LLM agent pre-processes sources into structured wiki pages with summaries + tags + cross-links. The “retrieval” becomes filesystem traversal driven by the wiki’s own index.md. Faster, more accurate, no embedding model needed, but requires the maintenance loop.
Trade-off: vector RAG scales to TB of unstructured data. Wiki-RAG scales to ~thousands of curated pages. Different sweet spots.
See Also
- person-karpathy
- tech-neuromorphic-computing (hardware angle: retrieval cost dominates serving cost for many RAG workloads)