Retrieval-Augmented Generation (RAG)

Patterns for grounding an LLM in your own documents (or any external corpus) rather than relying on the model’s training-time knowledge. The fix for hallucination, staleness, and lack of private knowledge in a single architecture.

Coined by Facebook AI Research (Lewis et al., 2020 paper). Now the default integration pattern for production LLM systems wanting verifiable, citable answers.

The Core Loop

User Query
   │
   ▼
[1] Embed query → vector
   │
   ▼
[2] Search vector store / keyword index for top-K relevant chunks
   │
   ▼
[3] Stuff chunks into LLM prompt as CONTEXT
   │
   ▼
[4] LLM generates answer constrained to the context
   │
   ▼
[5] (Optional) cite chunks; return with sources

The whole thing fits in ~50 lines of code if you skip the vector DB and use a simple dot-product over numpy arrays.

Why It Beats Fine-Tuning for Most Cases

Concern	Fine-tune	RAG
Fresh data	Re-train (hours-days)	Add to index (seconds)
Cost	$$$ per training run	$ per query
Provenance	Mixed into weights, opaque	Citable to specific chunks
Hallucination	Trained-in is hard to remove	Tunable via prompt constraints
Multi-tenant	One model per tenant if private	One model, per-tenant retrieval
Reasoning gain	Real	None (knowledge only)

Fine-tune for new SKILLS or BEHAVIORS. RAG for new KNOWLEDGE.

The Quality Failure Modes

Most failed RAG systems fail at retrieval, not generation:

Bad chunking — semantic units split mid-sentence; tables broken across chunks; markdown headers stripped
Bad embeddings — using a model not aligned to your domain (legal corpus retrieved with general-web embeddings)
No reranking — top-K cosine similarity ≠ top-K relevance; need a cross-encoder pass
Single-pass retrieval — query goes in once. Good systems decompose, retrieve multiple times, synthesize
No metadata filtering — retrieving from “all of corp wiki” when you should pre-filter to “engineering pages from 2025”
Stale index — corpus updated, embeddings not refreshed. Citations become misleading.

Where RAG Doesn’t Help

Math + computation — LLM still does the reasoning. Wrong-context retrieval can make math worse, not better. Use code-interpreter or function-calling.
Highly compositional reasoning — multi-hop questions where you need to combine 5 facts. Naive RAG retrieves locally; misses connections. Solutions: GraphRAG, agentic loops.
Personalized voice/style — RAG gives knowledge, not voice. For voice, you fine-tune or in-context-learn from examples (few-shot).

The Wiki Pattern as a RAG Variant

The Karpathy LLM wiki pattern (this entire vault) is “RAG with extreme curation”: instead of dumping raw documents into a vector index, an LLM agent pre-processes sources into structured wiki pages with summaries + tags + cross-links. The “retrieval” becomes filesystem traversal driven by the wiki’s own index.md. Faster, more accurate, no embedding model needed, but requires the maintenance loop.

Trade-off: vector RAG scales to TB of unstructured data. Wiki-RAG scales to ~thousands of curated pages. Different sweet spots.

Quartz 4

Explorer

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

The Core Loop

Why It Beats Fine-Tuning for Most Cases

The Quality Failure Modes

Where RAG Doesn’t Help

The Wiki Pattern as a RAG Variant

See Also

Graph View

Table of Contents

Backlinks