vector database

RAG Deep Dive Series: Core Components

Part 2: RAG Architecture 101 — How It Actually Works

Mohamed Elamin

17 Feb 2026 • 24 min read

In Part 1, we answered the "why" RAG beats fine-tuning for most knowledge-intensive applications. We showed you the business case, the decision framework, and a real-world example of RAG in action.

But we intentionally kept the technical details light. We gave you the 30-second version: "user asks a question, system finds relevant documents, LLM generates an answer." That's what happens, but not how it happens.

Here's the thing: you can't make informed decisions about chunking strategies, embedding models, or retrieval techniques if you don't understand the architecture they fit into. It's like trying to optimize individual car parts without knowing how an engine works.

So in this post, we're popping the hood.

You'll learn:

The 5 core components every RAG system has and what job each one does
The two-phase architecture that separates preparation from runtime
A complete walkthrough from "user asks a question" to "system returns a cited answer"
Why each piece exists and how they work together
Where things can go wrong and what that looks like

Same promise as Post 1: We're starting from absolute zero. If you've never heard terms like vector database, embedding model or chunking techniques that's fine. We'll explain each concept using analogies and examples that make the technical accessible. No code, no math just clear explanations of how the pieces fit together.

By the end of this post, you'll have the mental model you need to understand everything else in this series. Think of this as your foundation.

Let's build it.

The Big Picture: RAG as a System
The Two Phases: Indexing vs. Query Time
The 5 Core Components
The Complete Flow: Question to Answer
What Can Go Wrong (And Why)
Key Takeaways
What's Next

The Big Picture: RAG as a System

Let's start by refreshing your memory. Remember the open-book exam analogy we mentioned in Part 1? Let's revisit it, because it's the key to understanding everything else.

The Open-Book Exam Recap:

When a user asks a question in a RAG system, here's what happens under the hood:

The RAG system searches for content in your documents that's related to the user's question
It passes that relevant content to the LLM
It tells the LLM: "Here's the user's question, and here's the related content. Answer the question using only this content."

That's it. That's the core idea.

A Simple Example

Let's say a user asks: "What's your refund policy?"

Here's what happens:

But How Does It Actually Do That Search?

This is where most explanations stop. "It searches your documents and finds relevant content." Cool. But how?

How does it know which documents are "relevant"?
How does it understand that "refund policy" and "return policy" and "money back guarantee" all mean similar things?
How does it search through thousands of documents in milliseconds?
Where is all this content stored?
What happens if the relevant information is split across multiple documents?

These aren't trivial questions. The answers involve embedding models, vector databases, semantic similarity, and retrieval algorithms. But here's the good news: each of these concepts is actually pretty straightforward once you understand what problem it's solving.

So let's break it down. We'll start with the two phases of a RAG system, then dive into the five components that make it work.

The Two Phases: Indexing vs. Query Time

Here's the first thing you need to understand about RAG systems: they work in two completely separate phases.

Think of it like a library:

Phase 1: Setting Up the Library (Indexing)

You organize all the books
You create a cataloging system
You put books on shelves in a logical order
You create an index so people can find things

Phase 2: Helping Someone Find a Book (Query Time)

A person walks in and asks a question
You use your catalog/index to find relevant books
You hand them the books they need
They read and get their answer

RAG works exactly the same way. There's a preparation phase (indexing) that happens once, and a query phase that happens every time someone asks a question.

Let's break down each phase.

Phase 1: Indexing (The Preparation Phase)

When it happens: Once, when you first set up your RAG system. Then again whenever you add or update documents.

What happens:

This is all preparation work. No user has asked a question yet. You're just organizing your knowledge base so it's ready to be searched.

How long does it take?

For a few hundred documents: Minutes
For thousands of documents: Hours
For millions of documents: Could take days

But you only do this once or whenever documents change. It's not blocking users.

Phase 2: Query Time (The Runtime Phase)

When it happens: Every single time a user asks a question.

What happens:

How fast is it?

Embedding the query: ~50-100ms
Vector search: ~50-200ms (even across millions of vectors!)
LLM generation: 1-2 seconds
Total: 2-3 seconds end-to-end

Why the Separation Matters

Because the preparation work is SLOW, but you only need to do it once.

The Trade-off:

Phase	Speed	Frequency
Indexing	Slow (hours)	Rare (only when docs change)
Query Time	Fast (seconds)	Constant (every user question)

This is why RAG can handle real-time queries even with massive document collections. All the heavy lifting happened beforehand.

Real-World Implications

When you add a new document:

❌ Users don't wait for it to be indexed
✅ Indexing happens in the background
✅ Document becomes searchable once indexing completes (minutes to hours later)

When a user asks a question:

✅ They get an answer in seconds
✅ The system searches already-indexed content
✅ No heavy processing needed

This is fundamentally different from fine-tuning:

Approach	Update Time	Query Time
Fine-tuning	Days (retrain entire model)	Fast
RAG	Minutes/Hours (reindex new docs)	Fast

Both are fast at query time, but RAG can incorporate new information orders of magnitude faster.

The Big Picture

So now you understand the two-phase split:

Indexing Phase (Offline): Prepare your documents so they're searchable
Query Phase (Online): Use those prepared documents to answer questions

But we still haven't answered the "how" questions:

How do you convert text to "searchable vectors"?
What even is a vector database?
How does the system know which chunks are "similar" to the question?

That's where the 5 core components come in. Each component handles a specific job in this pipeline.

Let's dive into each one.

The 5 Core Components

Every RAG system has five core components. Some implementations might add extra bells and whistles, but these five are universal. If you're missing one of these, you don't have a RAG system.

Here's the lineup:

Let's break down each component and understand what job it does.

Component 1: Knowledge Base

What it is: Your collection of source documents PDFs, Word docs, web pages, databases, whatever information you want the RAG system to search through.

The job it does: Provides the raw material that everything else works with.

What Goes in a Knowledge Base?

Pretty much any text-based information:

Documents: PDFs, Word files, Google Docs, Markdown files
Structured data: FAQs, help articles, wikis, knowledge bases
Policies & procedures: Employee handbooks, compliance docs, SOPs
Technical content: API documentation, product manuals, troubleshooting guides
Communications: Emails, Slack messages, meeting notes
Databases: Product catalogs, customer records, transaction logs

The key principle: If a human could read it to find an answer, RAG can search it.

Quality In = Quality Out

Here's the brutal truth: your RAG system can only be as good as the documents you feed it.

This isn't like training an LLM where the model can learn patterns and generalize. With RAG, if the information isn't in your knowledge base, the system literally cannot answer the question. There's no creativity, no guessing, no filling in gaps.

Common quality issues that break RAG systems:

Real example:

Your knowledge base has:

Version 1 of your refund policy (outdated): "30-day refunds"
Version 2 of your refund policy (current): "14-day refunds"

When someone asks about refunds, the RAG system might retrieve BOTH chunks. The LLM then has to work with contradictory information. Best case: it gets confused. Worst case: it confidently tells a customer they have 30 days when they only have 14.

The fix? Clean your knowledge base. Deduplication and version control aren't optional they're critical.

The Knowledge Base Checklist

Before you index documents, make sure they're:

Deduplicated – No repeated content across files
Up-to-date – Latest versions only, old versions archived or deleted
Properly formatted – Consistent structure (headings, lists, tables)
Clean text – No garbled characters, broken encoding, or OCR errors
Tagged with metadata – Source, date, category, version, author

Pro tip: Don't assume your documents are clean just because they "look fine" to humans. Run some test queries and see what gets retrieved. You'll be surprised what breaks.

Component 2: Embedding Model

What it is: A specialized AI model that converts text into mathematical vectors (lists of numbers) that capture semantic meaning.

The job it does: Translates human language into a format computers can mathematically compare.

The Problem Embeddings Solve

Let's start with why we even need this.

Imagine you have these three documents in your knowledge base:

"How to request vacation days"
"Annual leave policy"
"Office parking information"

A user asks: "How do I take time off?"

How does the computer know that:

"time off" = "vacation days" = "annual leave" ✅
"time off" ≠ "parking" ❌

Words are different. But meanings are similar.

Traditional keyword search can't do this. It looks for exact matches:

"time off" → No match in any document ❌
Zero results returned

Embeddings solve this by capturing semantic meaning, not just keywords.

What Are Embeddings?

An embedding is a list of numbers typically hundreds of numbers that represents the meaning of a piece of text.

Simple example:

Notice anything? The first two questions have similar vectors (the numbers are close), even though they use completely different words. The third question has a different vector because it's asking about something else entirely.

This is the magic of embeddings: Similar meanings = similar vectors, regardless of the exact words used.

How Embedding Models Work (Conceptually)

You don't need to understand the neural network architecture. Here's the concept:

Embedding models are trained on massive amounts of text to learn that certain words and phrases tend to appear in similar contexts:

"refund," "return," "money back," "reimbursement" all appear in similar contexts → similar vectors
"lunch," "dinner," "breakfast," "meal" appear together → similar vectors
"refund" and "lunch" rarely appear together → different vectors

The model learns these patterns from millions of examples and compresses that knowledge into vector representations.

Instead of treating "car" as just the string "c-a-r", embeddings understand that "car" is connected to a whole cloud of related concepts:

Think of it like a language-aware GPS:

Instead of latitude/longitude, it maps words to a high-dimensional space
Words with similar meanings end up close together
Words with different meanings end up far apart

All of these related words end up with similar vectors:

Meanwhile, unrelated words live in completely different neighborhoods:

Why the Same Model for Everything?

Critical rule: You MUST use the same embedding model for both:

Indexing your documents (preparation phase)
Embedding user queries (query phase)

Why? Because different models create different "spaces."

Analogy:

Model A uses latitude/longitude
Model B uses street addresses

If you index with Model A but search with Model B, you're trying to compare coordinates to addresses. They're incompatible, even if they're describing the same location.

In practice:

Index documents with text-embedding-3-small
Must also embed queries with text-embedding-3-small
Mixing models = broken retrieval

Popular Embedding Models

You don't need to pick one yet, but here are the common options:

Model	Provider	Dimensions	Best For
text-embedding-3-small	OpenAI	1536	General purpose, good baseline
text-embedding-3-large	OpenAI	3072	Higher accuracy needs
text-embedding-004	Google	768	Multilingual support
embed-v3	Cohere	1024	Multilingual, compression
all-MiniLM-L6-v2	Open Source	384	Fast, lightweight, self-hosted

Key Properties of Good Embeddings

Property	What It Means	Why It Matters
Dimensionality	Length of the vector (e.g., 384, 768, 1536)	Higher = more nuanced but slower
Semantic Clustering	Similar meanings → nearby vectors	Enables accurate retrieval
Domain Alignment	Trained on similar content	General models may miss domain jargon

What are Dimensions? That's how many numbers are in each vector. More dimensions = more nuance, but slower and more expensive.

For now, just know the embedding model is the translator that makes semantic search possible. Without it, you're stuck with keyword matching.

Component 3: Vector Database

What it is: A specialized database designed specifically to store and search through vectors those lists of numbers from embeddings.

The job it does: Stores embeddings and finds the most similar ones to a query vector, fast.

Why Not Just Use a Regular Database?

You might think: "I already have PostgreSQL/MySQL/MongoDB. Can't I just store vectors there?"

Technically, yes. Practically, no.

Here's the problem:

Traditional databases were built to answer questions like:

"Find all customers WHERE age > 30"
"Show me products WHERE price < 100"
"Get employees WHERE name = 'John'"

But RAG needs to ask:

"Find the 5 vectors most similar to this query vector"

That's a completely different operation.

The Mathematical Difference

Traditional databases use exact matching and sorting:

There's no IS_SIMILAR_TO operator in SQL. You'd have to manually compute similarity for EVERY row:

-- You'd have to do this (conceptually):
SELECT *, 
  CALCULATE_SIMILARITY(embedding, query_vector) AS score
FROM documents
ORDER BY score DESC
LIMIT 5;

The problem with this approach:

Issue	Why It Hurts
Full table scan	Must check EVERY single row, every query
Expensive math	Computing similarity = hundreds of multiplications per row
No index helps	B-tree indexes work for sorting numbers, not finding neighbors in 768-dimensional space
Scale nightmare	1 million documents × 768 numbers each = 💀

For a million documents, this would take MINUTES per query. Users would leave before getting an answer.

Enter Vector Databases

Vector databases were designed from the ground up to solve this exact problem.

What they provide:

Specialized indexes for high-dimensional vectors (HNSW, IVF, PQ - we'll cover these in a different Post)
Approximate Nearest Neighbor (ANN) search. Trade tiny accuracy loss (99%→98%) for massive speed
Built-in similarity functions. Cosine, Euclidean, Dot Product - ready to use
Optimized for vector operations. Everything is designed around this one use case

Instead of complex SQL, you get simple APIs:

That's it. No manual math, no optimization headaches. Just fast similarity search.

What Gets Stored For each chunk of your documents, the vector database stores:

Why store both vector AND text?

The vector is for searching (finding similar chunks) -

The text is what gets passed to the LLM (the actual content)

You search with vectors, but you answer with text.

Popular Vector Databases

Database	Type	Best For
Pinecone	Managed (Cloud)	Zero ops, quick start
Weaviate	Open Source / Managed	Hybrid search, multimodal
Qdrant	Open Source / Managed	High performance, rich filtering
Milvus	Open Source	Enterprise scale
Chroma	Open Source	Local dev, lightweight
pgvector	PostgreSQL Extension	Already using Postgres, want vector capabilities

Different databases have different strengths, but they all solve the same core problem storing and searching vectors efficiently.

Component 4: Retriever

What it is: The orchestrator that takes a user's question, converts it to a vector, searches the vector database, and returns the most relevant chunks.

The job it does: Bridges the gap between a human question and the relevant documents.

The Retriever's Workflow The retriever is called every single time a user asks a question. Here's what it does:

Why it's called the "retriever": It retrieves the relevant information. That's literally its only job, but it's the most critical part of the pipeline.

The Top-K Decision

K= the number of chunks to retrieve per query.

You might think: "Just grab the single best match, right?"

Not so fast. There are several reasons why you typically retrieve multiple chunks (K=3 to K=10):

Reason 1: Information Spread Across Chunks

The complete answer often lives in multiple chunks:

If K=1, the user only learns about the 90 days. They miss the application process and return rights.

Reason 2: Highest Similarity ≠ Most Useful

Sometimes the highest-scoring chunk talks ABOUT the topic but doesn't answer the question:

The first chunk scores higher (more similar to "leave") but doesn't answer "how to apply." The second chunk is what the user needs.

Reason 3: Ambiguous Queries

When queries are vague, multiple interpretations score high:

Retrieving multiple chunks (K=5-10) covers different interpretations.

The Trade-off

K Value	Pros	Cons
Too few (K=1-2)	Fast, focused	Might miss info, incomplete answers
Sweet spot (K=3-10)	Balanced	Depends on chunk size and use case
Too many (K=20+)	Don't miss anything	Noise dilutes signal, slower, more expensive

Most RAG systems use K=5 as a starting point, then tune based on results.

Semantic Search in Action

Here's the magic moment why embeddings make semantic search possible.

Three users ask about the same thing with different words:

With keyword search: Only User C gets results (exact match on "annual leave")

With semantic search : ALL THREE users find the document

Why? Because the embedding model understands:
"vacation days" ≈ "annual leave" ≈ "paid time off" ≈ "PTO"

All these questions get embedded to similar vectors, which match the document's vector.

The retriever doesn't care about exact words it cares about meaning.

Component 5: Generator (LLM)

What it is: The Large Language Model (like GPT-4, Claude, or Gemini) that reads the retrieved chunks and writes the final answer.

The job it does: Synthesizes retrieved information into a coherent, natural language response.

The Open-Book Exam Revisited

Remember our analogy from the beginning? The Generator is where it all comes together.

The full picture:

The retriever found the right pages. Now the LLM reads them and writes an answer.

What the LLM Receives

Here's what gets passed to the LLM:

The LLM's job: Read the context, understand what's relevant, and write a clear answer.

What Makes the Generator Different from a Regular LLM Call

The System Prompt Matters

The system prompt is your instruction manual for the LLM. It tells the LLM how to use the retrieved context.

Critical instructions to include:

Bad system prompt:

Answer the user's question helpfully.

This gives the LLM permission to use its training data, which defeats the purpose of RAG.

Good system prompt:

Answer using ONLY the provided context. If the context doesn't contain 
enough information to answer confidently, say 'I don't have enough 
information about that in the available documents.' Always cite which 
document or section you're referencing.

This keeps the LLM grounded in your documents.

Why RAG Still Needs the LLM's Intelligence

You might wonder: "If we have the right chunks, why not just show them to the user?"

Because raw chunks are messy:

Retrieved Chunk 1: "Section 4.2.1: Sick Leave Entitlement Employees shall be entitled to fifteen (15) days of paid sick leave per annum, calculated on a calendar year basis (January 1 - December 31)."

Retrieved Chunk 2: "Section 4.2.3: Medical Certification Requirements In accordance with company policy, sick leave absences exceeding a period of three (3) consecutive business days shall require submission of valid medical certification from a licensed healthcare provider." 

Retrieved Chunk 3: "Section 4.2.5: Leave Carry-Over Provisions Unused sick leave balances shall not be carried forward to subsequent calendar years and shall expire on December 31 of each year."

User asked: "How many sick days do I get?"

Just showing chunks: Legal jargon, scattered across three sections, hard to read.

LLM synthesis:

You get 15 sick days per year. Here are the key details: 


- You're entitled to 15 days of paid sick leave annually (January 1 - December 31) 

- If you're sick for more than 3 days in a row, you'll need a doctor's note 

- Unused sick days don't roll over to next year they expire on December 31 

Source: HR Policy Manual, Sections 4.2.1, 4.2.3, 4.2.5

What the LLM did:

Extracted the key fact (15 days)
Translated legal language to plain English
Organized information logically
Added helpful context (what happens if you're sick longer)
Cited sources properly

This is why you need both: Retrieval gets you the right information.

The LLM makes it readable and actionable.

The LLM's Semantic Understanding

There's one more subtle thing happening here that's easy to miss.

Remember the retriever found documents about "annual leave" when the user said "PTO"?

Well, the LLM continues that semantic understanding in the answer.

The LLM understands that:

The retrieved chunk calls it "annual leave"
The user calls it "PTO"
They mean the same thing
The user would prefer to see their own terminology in the response

This is why embeddings + LLMs together are so powerful. The entire pipeline has semantic understanding from start to finish.

The Complete Flow: Question to Answer

You've seen the individual components. Now let's watch them work together in a real scenario.

The Setup:

You've built a RAG system for your company's HR department. Your knowledge base contains:

Employee handbook
Benefits guide
Leave policies
Payroll procedures
Compliance documents

Everything has been indexed (the preparation phase is done). Now an employee opens the chat and asks a question.

Let's walk through what happens, step by step.

The Scenario

Employee asks: "How many sick days do I get per year?"

Goal: Give them an accurate, cited answer in under 3 seconds.

Phase 1: What Already Happened (Indexing)

Before this question was ever asked, during the indexing phase, your system did this:

Time taken: Hours (for thousands of documents)
When it runs: Once during setup, then whenever documents are updated

Now all your documents are "searchable." The vector database is ready.

Phase 2: What Happens Now (Query Time)

The employee hits enter. The clock starts ticking.

⏱️ T+0ms: Question Received

User Input: "How many sick days do I get per year?"

The RAG system receives the question. Time to find the answer.

⏱️ T+50ms: Step 1 - Embed the Query

Component in action: Embedding Model

Why this is fast: Embedding models are optimized for speed. Converting one sentence to a vector takes milliseconds.

⏱️ T+100ms: Step 2 - Search the Vector Database

Component in action: Vector Database (via Retriever)

What just happened:

Vector database compared the query vector against millions of stored vectors
Found the 5 most similar chunks
Returned them ranked by similarity score

Why this is fast: Specialized vector indexes make this search incredibly efficient, even with millions of vectors.

⏱️ T+150ms: Step 3 - Assemble the Context

Component in action: Retriever

The retriever takes those 5 chunks and formats them into a prompt for the LLM:

What the retriever did:

Took the raw search results
Formatted them with source citations
Added the system prompt
Added the user's original question
Created one complete prompt ready for the LLM

⏱️ T+150ms - T+2500ms: Step 4 - Generate the Answer

Component in action: Generator (LLM)

The LLM receives the assembled prompt and generates a response:

What the LLM did:

Answered the direct question (15 days)
Synthesized information from multiple chunks
Organized it logically with bullet points
Translated formal policy language to plain English
Added relevant context the user might need
Cited all sources
Stayed grounded in the provided context (didn't hallucinate)

Typical LLM generation time: 1-2 seconds (depends on model and response length)

⏱️ T+2500ms: Step 5 - Return to User

The Full Timeline

Let's see the complete journey:

Where the time goes:

Embedding: 50ms (fast)
Vector search: 50ms (fast, even with millions of vectors!)
Context assembly: 50ms (just text formatting)
LLM generation: 2000ms (slowest step, but streaming makes it feel faster)
Network overhead: ~300ms (user's internet, server latency)

The bottleneck: LLM generation is the slowest step, but:

Most modern LLMs support streaming (user sees words appearing in real-time)
2-3 seconds total is acceptable for most use cases
The accuracy/citation trade-off is worth it

What Made This Work?

Let's trace back and see why this query succeeded:

Good Chunking (Happened During Indexing)

The sick leave policy wasn't split mid-sentence
Related information (15 days + medical certificate requirement) stayed together in one chunk
Each chunk had enough context to stand alone

Quality Embeddings (Both Phases)

"sick days" (user's words) matched "sick leave" (document's words)
The embedding model understood semantic similarity
Same model used for indexing and querying

Effective Retrieval (Query Time)

Vector database found highly relevant chunks (score 0.94 = excellent match)
Top-5 retrieval captured the main answer + important context
Chunks from multiple documents provided comprehensive coverage

Grounded Generation (Query Time)

LLM used ONLY the provided context (no hallucination)
Synthesized information from multiple chunks into one coherent answer
Translated policy-speak to plain English
Cited sources properly

Clean Knowledge Base (Preparation)

No duplicate or contradictory information
Up-to-date policy (not mixing old and new versions)
Good metadata for source attribution

Where All 5 Components Showed Up

Let's map the components to the flow:

Component	When Used	What It Did
Knowledge Base	Indexing phase (hours ago)	Provided the source documents with sick leave policy
Embedding Model	Indexing (hours ago) + Query time (now)	Converted both documents and query to vectors
Vector Database	Indexing (hours ago) + Query time (now)	Stored vectors, then searched them to find top-5 matches
Retriever	Query time	Orchestrated: embed query → search → assemble context
Generator (LLM)	Query time	Read context, synthesized answer, cited sources

Every single component played a critical role. Remove any one, and the system breaks.

What Can Go Wrong (And Why)

Now that you've seen the happy path everything working perfectly let's talk about what can go wrong. Understanding failure modes is just as important as understanding how things work.

Failure Mode 1: Wrong Chunks Retrieved

Symptom: LLM gives a correct-sounding answer, but it's about the wrong topic.

Example:

What went wrong:

The query "parental leave" is semantically similar to all types of "leave"
Vector search returned high-similarity chunks, but for the wrong leave type
The actual parental leave policy got ranked 6th (below the top-5 cutoff)

Why it happened:

Embedding model isn't specific enough (treats all "leave" similarly)
OR: Chunks are too small (lost the "parental" context)
OR: Top-K is too low (K=5 missed the relevant chunk at rank 6)

How to fix:

Use hybrid search (keyword + semantic) to catch exact term "parental"
Increase Top-K to 10 to cast a wider net
Improve chunking to keep "parental leave" as a distinct semantic unit
Add metadata filtering (category = "parental leave policies")

Failure Mode 2: Information Split Across Chunks

Symptom: LLM gives an incomplete answer.

Example:

What went wrong:

Complete answer required 4 chunks
Only the first 2 were retrieved (Top-5 cutoff)
Chunking split a multi-step process into separate chunks
Those chunks didn't score high enough individually

How to fix:

Increase Top-K (retrieve more chunks)
Improve chunking strategy (keep related steps together)
Use hierarchical retrieval (retrieve section headers, then sub-chunks)

Failure Mode 3: Contradictory Information

Symptom: LLM gets confused or gives hedged/uncertain answers.

Example:

What went wrong:

Knowledge base contained both old and new policy versions
Both ranked highly (similar semantic meaning)
LLM received contradictory information

How to fix:

Clean knowledge base (remove outdated documents)
Add metadata filtering (version = "current", year = 2024)
Use document versioning in metadata
Implement "last-updated" logic in retrieval

Failure Mode 4: Over-Generic Queries

Symptom: Too many irrelevant chunks retrieved.

Example:

What went wrong:

Query was too vague
Everything about "leave" scored high
User probably meant a specific leave type but didn't say which

How to fix:

Query clarification (ask user: "Which type of leave?")
Query expansion (LLM suggests: "Did you mean sick leave, vacation leave, or parental leave?")
Multi-turn conversation (follow-up question to narrow scope)

Failure Mode 5: Hallucination Despite Context

Symptom: LLM adds information not in the retrieved chunks.

Example:

What went wrong:

The retrieved chunk was vague ("illness")
LLM filled in gaps with its training data
System prompt wasn't strict enough

How to fix:

Stricter system prompt: "Use ONLY the provided context. Do not add information from your training data."
Shorter context window (less room for the LLM to "think" beyond context)
Post-processing check (verify each sentence has a source chunk)

Failure Mode 6: Slow Retrieval

Symptom: System is accurate but too slow (>5 seconds).

What went wrong:

Vector database not optimized
Too many chunks to embed/search
Inefficient indexing algorithm

How to fix:

Use approximate nearest neighbor (ANN) search instead of exact
Optimize vector database indexes
Cache frequently-asked queries
Use faster embedding model (smaller dimensions)

The Common Thread

Notice a pattern in these failures?

Most failures happen at the retrieval stage, not the LLM stage.

The insight: If you retrieve the RIGHT chunks, even a mediocre LLM will give good answers. But if you retrieve WRONG chunks, even the best LLM can't save you.

This is why the middle of this series (Posts 3-5) focuses heavily on retrieval: chunking, embeddings, and advanced retrieval techniques. Get retrieval right, and most of your problems disappear.

Key Takeaways

You've just built your mental model of RAG architecture. Let's crystallize the key concepts.

The Mental Model

The Core Principles

1. Two Phases, Two Speeds

Indexing is slow but rare (only when documents change)
Querying is fast and constant (every user question)
This separation is what makes RAG scalable

2. Embeddings Are the Bridge

They translate human language to mathematical space
Similar meanings = similar vectors
This enables semantic search (not just keyword matching)

3. Vector Databases Are Purpose-Built

Traditional databases can't efficiently search high-dimensional vectors
Vector databases use specialized indexes for fast similarity search
They make searching millions of vectors possible in milliseconds

4. Retrieval Is the Bottleneck

70% of RAG failures happen at retrieval
Getting the RIGHT chunks matters more than having the BEST LLM
This is why Posts 3-5 focus heavily on retrieval techniques

5. The LLM Synthesizes, Not Memorizes

The LLM doesn't "know" your data (it's not in the model weights)
It reads retrieved chunks and writes an answer
This is why RAG can stay current without retraining

The Questions You Can Now Answer

After reading this post, you should be able to explain:

What's the difference between indexing and query time?

Indexing = preparation (slow, rare)
Query time = answering questions (fast, constant)

Why do we need embeddings?

To enable semantic search
So "vacation days" can match "annual leave"

Why can't we just use a regular database?

Traditional databases weren't built for high-dimensional similarity search
Vector databases are optimized specifically for this

What does Top-K mean?

The number of chunks to retrieve per query
Typically K=3 to K=10 for most systems

What's the slowest part of a RAG query?

LLM generation (1-2 seconds)
But streaming makes it feel faster

Where do most RAG systems fail?

Retrieval (wrong chunks, incomplete info, contradictions)
Get retrieval right, and most problems disappear

What You Don't Need to Worry About Yet

We intentionally kept some details out of this post. You don't need to understand these yet:

How to choose an embedding model (Post 4)
What chunking strategy to use (Post 3)
How HNSW or IVF indexes work (Post 4)
Whether to use hybrid search or pure vector search (Post 5)
How to evaluate your RAG system (Post 8)

For now, you have the foundation. Everything else builds on this mental model.

What's Next

You understand the architecture. You know the five components. You've seen the complete flow from question to answer.

But here's the thing: Building a RAG system is one thing. Building a GOOD RAG system is entirely different.

The difference between a RAG system that works 60% of the time and one that works 95% of the time? It's not the LLM. It's the decisions you make about chunking, embeddings, and retrieval.

The Journey Ahead

Post 3 Preview: Chunking Strategies

In the next post, we're diving into the decision that determines what your RAG system can find: how you chunk your documents.

Here's what we'll cover:

The Chunking Problem:

Remember that sick leave example? The answer worked because the chunking kept "15 days" + "medical certificate requirement" in the same chunk
What if the chunking had split mid-sentence? The answer would have been incomplete
Chunking is the invisible decision that makes or breaks retrieval

What You'll Learn:

Fixed-Size vs. Semantic Chunking

Should you split every 512 tokens? Or based on meaning?
What's the trade-off between simplicity and accuracy?

Handling Structure

What do you do with tables? Bullet points? Nested lists?
How do you preserve context in hierarchical documents?
When does structure matter, and when can you ignore it?

The Overlap Strategy

Should chunks overlap? By how much?
What's the trade-off between redundancy and lost context?

Chunk Size Optimization

Too small = lost context, too many irrelevant retrievals
Too large = irrelevant information dilutes the signal
How do you find the sweet spot for YOUR documents?

Real-World Examples

See side-by-side comparisons: good chunking vs. bad chunking
Understand why the same query retrieves different results based on chunking strategy
Learn the chunking decisions that took systems from 60% → 95% accuracy

Why Chunking Matters More Than You Think

Here's the brutal truth about RAG:

If you chunk poorly, no amount of fancy retrieval techniques will save you. You could use the best embedding model, the fastest vector database, and the most advanced reranking but if the right information isn't in a retrievable chunk, your system will fail.

Chunking is the foundation. Get it right, and everything else gets easier. Get it wrong, and you'll spend months debugging retrieval problems.

In Post 3, you'll learn:

How to analyze your documents and pick the right chunking strategy
The questions to ask before you start chunking
Common chunking mistakes and how to avoid them
When to use advanced techniques vs. simple approaches

See You in Post 3

You've got the architecture. Now let's master the first real implementation decision: how to prepare your documents so RAG can actually find what it needs.

If you missed Part 1 of this series, you can read it here.

Ready to Build Your RAG System?

We help companies build production-grade RAG systems that actually deliver results. Whether you're starting from scratch or optimizing an existing implementation, we bring the expertise to get you from concept to deployment. Contact us and Let's talk about your use case.

Contact Kalvad | Engineering & Technology Consulting

Get in touch with Kalvad to discuss your engineering, R&D, or technology consulting needs with our expert team.

Engineering & Technology Consulting

Part 2 of the RAG Deep Dive Series | Next up: Chunking Strategies

Table of Contents

The Big Picture: RAG as a System

A Simple Example

But How Does It Actually Do That Search?

The Two Phases: Indexing vs. Query Time

Phase 1: Indexing (The Preparation Phase)

Phase 2: Query Time (The Runtime Phase)

Why the Separation Matters

Real-World Implications

The Big Picture

The 5 Core Components

Component 1: Knowledge Base

What Goes in a Knowledge Base?

Quality In = Quality Out

The Knowledge Base Checklist

Component 2: Embedding Model

The Problem Embeddings Solve

What Are Embeddings?

How Embedding Models Work (Conceptually)

Why the Same Model for Everything?

Popular Embedding Models

Key Properties of Good Embeddings

Component 3: Vector Database

Why Not Just Use a Regular Database?

The Mathematical Difference

The problem with this approach:

Enter Vector Databases

What Gets Stored For each chunk of your documents, the vector database stores:

Why store both vector AND text?

Popular Vector Databases

Component 4: Retriever

The Top-K Decision

Reason 1: Information Spread Across Chunks

Reason 2: Highest Similarity ≠ Most Useful

Reason 3: Ambiguous Queries

The Trade-off

Semantic Search in Action

Component 5: Generator (LLM)

The Open-Book Exam Revisited

The full picture:

What Makes the Generator Different from a Regular LLM Call

The System Prompt Matters

Why RAG Still Needs the LLM's Intelligence

The LLM's Semantic Understanding

The Complete Flow: Question to Answer

The Scenario

Phase 1: What Already Happened (Indexing)

Phase 2: What Happens Now (Query Time)

The Full Timeline

What Made This Work?

Where All 5 Components Showed Up

What Can Go Wrong (And Why)

Failure Mode 1: Wrong Chunks Retrieved

Failure Mode 2: Information Split Across Chunks

Failure Mode 3: Contradictory Information

Failure Mode 4: Over-Generic Queries

Failure Mode 5: Hallucination Despite Context

Failure Mode 6: Slow Retrieval

The Common Thread

Key Takeaways

The Mental Model

The Core Principles

The Questions You Can Now Answer

What You Don't Need to Worry About Yet

What's Next

The Journey Ahead

Post 3 Preview: Chunking Strategies

Why Chunking Matters More Than You Think

See You in Post 3