Retrieval-Augmented Generation (RAG)

Patterns for grounding an LLM in your own documents (or any external corpus) rather than relying on the model’s training-time knowledge. The fix for hallucination, staleness, and lack of private knowledge in a single architecture.

Coined by Facebook AI Research (Lewis et al., 2020 paper). Now the default integration pattern for production LLM systems wanting verifiable, citable answers.

The Core Loop

User Query
   │
   ▼
[1] Embed query → vector
   │
   ▼
[2] Search vector store / keyword index for top-K relevant chunks
   │
   ▼
[3] Stuff chunks into LLM prompt as CONTEXT
   │
   ▼
[4] LLM generates answer constrained to the context
   │
   ▼
[5] (Optional) cite chunks; return with sources

The whole thing fits in ~50 lines of code if you skip the vector DB and use a simple dot-product over numpy arrays.

Why It Beats Fine-Tuning for Most Cases

ConcernFine-tuneRAG
Fresh dataRe-train (hours-days)Add to index (seconds)
Cost$$$ per training run$ per query
ProvenanceMixed into weights, opaqueCitable to specific chunks
HallucinationTrained-in is hard to removeTunable via prompt constraints
Multi-tenantOne model per tenant if privateOne model, per-tenant retrieval
Reasoning gainRealNone (knowledge only)

Fine-tune for new SKILLS or BEHAVIORS. RAG for new KNOWLEDGE.

The Quality Failure Modes

Most failed RAG systems fail at retrieval, not generation:

  1. Bad chunking — semantic units split mid-sentence; tables broken across chunks; markdown headers stripped
  2. Bad embeddings — using a model not aligned to your domain (legal corpus retrieved with general-web embeddings)
  3. No reranking — top-K cosine similarity ≠ top-K relevance; need a cross-encoder pass
  4. Single-pass retrieval — query goes in once. Good systems decompose, retrieve multiple times, synthesize
  5. No metadata filtering — retrieving from “all of corp wiki” when you should pre-filter to “engineering pages from 2025”
  6. Stale index — corpus updated, embeddings not refreshed. Citations become misleading.

Where RAG Doesn’t Help

  • Math + computation — LLM still does the reasoning. Wrong-context retrieval can make math worse, not better. Use code-interpreter or function-calling.
  • Highly compositional reasoning — multi-hop questions where you need to combine 5 facts. Naive RAG retrieves locally; misses connections. Solutions: GraphRAG, agentic loops.
  • Personalized voice/style — RAG gives knowledge, not voice. For voice, you fine-tune or in-context-learn from examples (few-shot).

The Wiki Pattern as a RAG Variant

The Karpathy LLM wiki pattern (this entire vault) is “RAG with extreme curation”: instead of dumping raw documents into a vector index, an LLM agent pre-processes sources into structured wiki pages with summaries + tags + cross-links. The “retrieval” becomes filesystem traversal driven by the wiki’s own index.md. Faster, more accurate, no embedding model needed, but requires the maintenance loop.

Trade-off: vector RAG scales to TB of unstructured data. Wiki-RAG scales to ~thousands of curated pages. Different sweet spots.

See Also