How to Build a RAG System (Retrieval-Augmented Generation)
A practical engineering guide to building Retrieval-Augmented Generation: ingestion and chunking, embeddings, vector databases, retrieval and reranking, grounded prompts with citations, generation, and the evals that keep answers faithful.
Key Takeaways
- A RAG system retrieves the most relevant chunks of your own data and feeds them to an LLM at query time, so answers are grounded in your sources instead of the model's memory.
- The pipeline is a sequence — ingest and clean, chunk, embed, store vectors, retrieve, rerank, assemble the prompt, generate, and evaluate — and the weakest stage caps the quality of the whole system.
- Chunking strategy matters more than most teams expect: chunks that ignore document structure are the single most common cause of bad retrieval, and there is no universal "right" size.
- Retrieval quality, not the language model, is usually the bottleneck; hybrid search plus a reranking step recovers far more good answers than swapping in a bigger model.
- Grounded prompting with explicit citations and an instruction to answer only from the supplied context is what turns retrieved chunks into trustworthy, checkable answers.
- You cannot improve what you do not measure: evaluate retrieval (did the right chunks come back?) and faithfulness (did the answer stay true to them?) separately, because they fail for different reasons.
A RAG system retrieves the most relevant chunks of your own data and feeds them to an LLM at query time, so the answer is grounded in your sources instead of the model's memory. You build one as a pipeline: ingest and clean your documents, split them into chunks, turn each chunk into an embedding, store those vectors, then for every question retrieve the best matches, rerank them, paste them into the prompt, and generate a cited answer. The hard part is not any single component — it is that the whole system is only as good as its weakest stage, and the weakest stage is almost always retrieval.
This guide walks the pipeline in order, defines each term as it comes up, and is blunt about where real RAG systems break and how to fix them. It is written for engineers and technical product leads who want a working mental model before they pick tools. If you are still deciding whether retrieval is even the right approach for your problem, start with RAG vs fine-tuning and come back once you know you need grounding in your own data.
What is RAG?
RAG — retrieval-augmented generation — is a technique that fetches relevant information from your own data at query time and supplies it to a language model as context, so the model answers from that information rather than from whatever it absorbed during training. In one sentence: RAG gives a model knowledge it never saw at training time, on demand, for each question, and lets that knowledge be cited.
The reason RAG exists is that language models have two well-known limits. They do not know anything that happened after their training cutoff, and they do not know your private documents at all — your contracts, your codebase, your support history, your internal policies. Worse, when a model lacks a fact it tends to invent a plausible one. RAG addresses both by retrieving real passages and instructing the model to answer from them, which is why it is the default architecture for question-answering over a company's own knowledge.
A few terms recur throughout this guide, so it is worth fixing them now. An embedding is a list of numbers that represents the meaning of a piece of text; texts with similar meaning produce vectors that sit close together. A chunk is a passage of a document small enough to retrieve precisely. Retrieval is the search step that finds the chunks most relevant to a question. Reranking is a second, sharper scoring pass over those results. And grounding is the practice of forcing the answer to rely on retrieved sources you can point to.
How does a RAG system work?
A RAG system works in two phases: an offline phase that prepares your data, and an online phase that runs for every question. Most teams blur them together and then cannot tell which one is failing, so keep them separate in your head.
The offline (indexing) phase happens ahead of time and repeats whenever your data changes: you ingest and clean documents, chunk them, embed each chunk, and store the vectors. The online (query) phase happens live: you embed the user's question, retrieve candidate chunks, rerank them, assemble a prompt, and generate an answer. Drawn as a list, the full pipeline is:
- Ingest and clean your sources.
- Chunk documents sensibly.
- Generate embeddings for each chunk.
- Store the vectors in a vector database.
- Retrieve the top matches for a query.
- Rerank the candidates for true relevance.
- Assemble a grounded prompt and generate the answer.
- Evaluate retrieval and answer quality, then iterate.
Because each stage feeds the next, errors compound. A bad chunk produces a misleading embedding, which surfaces in retrieval, which poisons the prompt, which yields a confident wrong answer. That is why the rest of this guide treats the stages individually — and why your debugging should too. When an answer is wrong, find the earliest stage that went wrong rather than blaming the model at the end.
How do you ingest and clean your sources?
Ingestion is the unglamorous work of turning messy real-world documents into clean, structured text with metadata attached — and it sets the ceiling for everything downstream. Garbage in, grounded garbage out.
Pull text out of whatever formats you actually have: HTML, PDFs, Office files, wikis, ticket exports, transcripts. Strip the noise that adds no meaning — navigation bars, cookie banners, repeated headers and footers, boilerplate legal footers — because that text will otherwise get chunked, embedded, and retrieved as if it mattered. PDFs deserve special caution: naive extraction often scrambles multi-column layouts and tables, so verify the extracted text reads correctly before you trust it.
Just as important, attach metadata to every document and carry it through the pipeline: title, source URL or file path, author, last-updated date, and any access-control tags. You will need this to cite sources, to filter retrieval by recency or permission, and to drop stale content. Metadata captured at ingestion is nearly free; reconstructing it later is painful.
How do you chunk documents?
You chunk documents by splitting them into passages that are small enough to retrieve precisely but large enough to remain self-contained — and you split along the document's natural structure, not at an arbitrary character count. Bad chunking is the single most common cause of bad RAG, full stop.
The failure mode is easy to picture. If you cut every 1,000 characters with no regard for structure, you will routinely slice a sentence — or a table, or a code block — straight down the middle. Half the answer lands in one chunk and half in another, the embeddings of both are muddy, and retrieval brings back a fragment that does not actually contain the answer. The model then has to guess, and guessing is exactly what you adopted RAG to prevent.
Sensible chunking comes down to a few principles:
- Respect structure. Split on headings, sections, and paragraphs first; only fall back to size limits within those boundaries. Markdown headings, HTML tags, and code-block delimiters are free signals about where ideas begin and end.
- Right-size for your content. A few hundred tokens per chunk is a common starting point, but dense technical material often wants smaller chunks while narrative text tolerates larger ones. Treat size as a parameter you tune, not a constant you copy from a tutorial.
- Add a little overlap. Letting neighboring chunks share a sentence or two prevents an answer from being severed exactly at a boundary, at the cost of some duplication.
- Keep tables and code intact. These break worst under naive splitting; isolate them and keep each one whole.
- Prepend context to each chunk. Storing the document title and section heading alongside the chunk text gives a lonely paragraph the context it needs to be understood and retrieved well.
What are embeddings and how do you generate them?
An embedding is a numeric vector that captures the meaning of a chunk, and you generate one by running each chunk through an embedding model. Because similar meanings yield nearby vectors, embeddings are what let you search by semantic similarity — finding a passage about "canceling a subscription" even when the user typed "how do I stop being billed."
Two rules prevent most embedding mistakes. First, use the same embedding model for your chunks and your queries; if they differ, the vectors live in incompatible spaces and similarity becomes meaningless. Second, treat the embedding model as a versioned dependency: record which model and version produced your index, because upgrading the model means re-embedding the entire corpus. Beyond that, batch the embedding calls for throughput, and consider that domain-heavy jargon sometimes embeds better with a model tuned for that domain — a trade-off worth testing against your own evals rather than assuming. The embedding model is a meaningful choice, and the same disciplined approach you would use to choose the right LLM applies to picking it.
Which vector database should you use?
You should use the vector store that matches your scale, freshness needs, and existing infrastructure — not the one with the loudest marketing. A vector database is specialized storage that indexes embeddings for fast similarity search, but it is a means to good retrieval, not the point of RAG.
The realistic options fall into a few buckets, and the right pick is mostly about operational fit:
- A vector extension on a database you already run. If your data already lives in a relational database that offers vector search, this is often the lowest-friction start — one system to operate, and you can mix vector search with ordinary filters and joins.
- A purpose-built vector database. Dedicated stores are built for large-scale similarity search with rich metadata filtering and tuning knobs. Reach for one when corpus size, query volume, or filtering needs outgrow a general-purpose database.
- A managed vector service. Hosted offerings remove the operational burden of running the index yourself, in exchange for cost and less control. Sensible when you want to move fast and not babysit infrastructure.
- An in-memory or on-disk library. For a small or static corpus, a lightweight local index can be entirely sufficient and is the simplest thing that works.
Whatever you choose, insist on solid metadata filtering — the ability to scope a search by source, date, or permission. In any real product, filtering retrieval to what the current user is allowed to see is not optional, and bolting access control on afterward is far harder than choosing a store that supports it natively.
How do you retrieve and rerank results?
You retrieve by embedding the question and searching the store for the nearest chunks, then you rerank those candidates with a sharper model that reorders them by true relevance to the query. Retrieval is where most RAG quality is won or lost — not in the language model. A bigger generator cannot answer from a passage it was never given.
Two upgrades to naive vector search pay for themselves quickly:
- Hybrid search. Pure semantic search can miss exact terms — product names, error codes, acronyms, part numbers — because their meaning is not captured well by similarity. Combining vector search with traditional keyword search catches both the meaning and the literal string, and the union is consistently stronger than either alone.
- Reranking. The first search is fast but approximate, so run it loose and over-fetch a wide candidate set. Then pass those candidates through a reranker — a model that reads each one against the actual query and scores its real relevance — and keep only the top few. Reranking is one of the highest-leverage additions in all of RAG.
Two practical guardrails: filter by metadata during retrieval so you never even consider documents the user should not see, and resist the urge to feed the model everything you found. Stuffing twenty marginal chunks into the prompt buries the answer in noise, slows the response, and runs up the bill. A few highly relevant chunks beat a pile of mediocre ones — which is also one of the simplest ways to reduce LLM API costs without hurting quality.
How do you assemble the prompt and generate the answer?
You assemble the prompt by combining the reranked chunks with the user's question and an instruction set that forces the model to stay grounded, then you call the LLM to generate the answer. This is the step where retrieved text becomes a trustworthy, checkable response — or where a good retrieval gets squandered by a sloppy prompt.
A grounded prompt should do four things explicitly:
- Provide the context clearly. Present the retrieved chunks as labeled sources, each carrying an identifier and its metadata, so the model can attribute claims to specific passages.
- Constrain the model to the context. Instruct it to answer using only the supplied sources and not its own background knowledge. This is the core grounding instruction, and it is what keeps RAG honest.
- Require citations. Ask the model to indicate which source supports each claim, so answers can be verified against the originals rather than taken on faith.
- Allow "I do not know." Tell it to say the context does not contain the answer when that is true. Without this, a model will fill the gap with a confident invention, which is the failure mode you adopted RAG to avoid.
A compact prompt skeleton makes the shape concrete:
SYSTEM:
Answer the question using only the sources below.
Cite the source id in brackets after each claim.
If the sources do not contain the answer, say you don't know.
SOURCES:
[1] (title, date) ...chunk text...
[2] (title, date) ...chunk text...
QUESTION:
{user question}Then generation itself is a normal LLM call — but the retrieval context is what makes it grounded. If you find yourself reaching for a far larger, costlier model to fix wrong answers, pause: the fix is usually better chunks in the prompt, not more parameters behind it.
How do you evaluate a RAG system?
You evaluate a RAG system by measuring retrieval and generation separately, because they fail for different reasons and a single end-to-end score hides which one broke. The two questions are: did the right information come back, and did the answer stay faithful to it?
Start by building a small evaluation set — a collection of real questions paired with the passages that should answer them and a known-good response. A few dozen carefully chosen cases that cover your common and edge scenarios are worth far more than a vague sense that "it seems to work." Then measure two things:
- Retrieval quality. For each question, check whether the correct passage appears in the top results. Track how often it does, and how highly it ranks. If retrieval is missing the right passages, no amount of prompt or model tuning will save the answer — the problem is upstream in chunking, embeddings, or search.
- Answer faithfulness. Check whether every claim in the answer is actually supported by the retrieved context, rather than invented. Faithfulness is usually scored with a blend of human review and LLM-as-judge grading, and it is the metric that tells you whether your grounding is holding. Track answer relevance and latency alongside it.
Treat these evals like a regression suite: run them automatically on every change to chunking, embeddings, retrieval, reranking, or prompts, so you can tell whether a tweak actually helped or just felt better. And close the loop — when the live system gives a wrong answer, add that case to the evaluation set so the same failure cannot silently return. The discipline here is the same one that governs evaluating and testing AI agents in general: you cannot improve what you do not measure.
What are the common RAG failure modes — and how do you fix them?
Most RAG systems fail in a handful of recognizable ways, and nearly all of them trace back to retrieval rather than the language model. Naming the failure precisely is what points you at the fix.
- Bad chunking. Symptom: retrieved passages are fragments that do not contain the full answer. Fix: chunk along structure, size chunks for your content, add overlap, and keep tables and code intact.
- Weak retrieval. Symptom: the right passage exists in your corpus but never comes back for the query. Fix: add hybrid search so exact terms are not lost, and add a reranking pass to push true matches to the top.
- Stuffed context. Symptom: the answer is buried, vague, or slow because too many marginal chunks were crammed into the prompt. Fix: rerank hard and pass only the top few chunks, not everything retrieved.
- Ungrounded generation. Symptom: the model invents facts or ignores the supplied sources. Fix: constrain it to the context, require citations, and explicitly permit "I do not know."
- Stale or unfiltered data. Symptom: confidently outdated answers, or content surfaced to users who should not see it. Fix: keep the index fresh and filter retrieval by date and permission via metadata.
Notice the pattern: when a RAG answer is wrong, walk the pipeline from the front and find the earliest stage that failed. The temptation is always to reach for a bigger model at the end, but the leverage is almost always earlier — in how you chunked, searched, and grounded. RAG also rarely ships alone; it is usually one capability inside a larger agent that also calls tools and APIs, which is its own design discipline covered in how to design software and APIs for AI agents.
Game Changer Labs builds production RAG systems — the kind that stay grounded, cite their sources, and hold up under real traffic — for teams across AI, civic, and enterprise software. If you have a corpus of knowledge you want an LLM to answer from reliably, we can help you scope the pipeline, choose the stack, and ship the evals that keep it honest.
Frequently Asked Questions
What is a RAG system?
A RAG (retrieval-augmented generation) system is an application that, for each user question, searches a collection of your own documents, retrieves the most relevant passages, and feeds them to a language model as context so it answers from those sources rather than from memory. It is the standard way to give an LLM current, proprietary, or domain-specific knowledge without retraining the model.
Do I need a vector database for RAG?
Often, but not always. A vector database makes semantic search over large document collections fast and scalable, which is why it is the common default. For a small corpus you can start with an in-memory index, keyword search, or a vector extension on a database you already run. Choose the store that fits your data size, freshness, and infrastructure — the vector database is a means to good retrieval, not the goal.
How big should RAG chunks be?
There is no universal size, but a common starting range is a few hundred tokens per chunk with some overlap between neighbors. The right size depends on your content: dense technical text often wants smaller chunks, while narrative or conversational text tolerates larger ones. Chunk along natural boundaries — headings, sections, paragraphs — rather than at a fixed character count, then tune the size against your evals.
Why is my RAG system giving wrong answers?
Usually the retrieval failed before the model ever saw the question, or the context was so cluttered the model missed the answer inside it. Check whether the correct passage is actually retrieved for failing queries; if it is not, the problem is chunking or search, not the LLM. If it is retrieved but ignored, tighten the prompt, reduce the number of chunks, and add a reranking step.
What is an embedding in RAG?
An embedding is a list of numbers — a vector — that represents the meaning of a piece of text, produced by an embedding model. Texts with similar meaning land close together in that numeric space, which lets you find relevant passages by mathematical similarity rather than exact keyword match. RAG stores an embedding for every chunk so it can semantically search them at query time.
What is reranking in RAG?
Reranking is a second, more precise scoring pass applied to the passages your first search returned. The initial vector or keyword search is fast but approximate, so it casts a wide net; a reranker model then reads each candidate against the actual query and reorders them by true relevance. Keeping only the top reranked passages usually improves answer quality more than tuning the generator.
Is RAG better than fine-tuning?
They solve different problems, so neither is universally better. RAG injects knowledge at query time and is the right tool when the model needs fresh, proprietary, or frequently changing facts it can cite. Fine-tuning bakes behavior, tone, and output format into the model and does not reliably add new facts. Many production systems use both; for most knowledge problems, start with RAG.
How do you evaluate a RAG system?
Evaluate retrieval and generation separately. For retrieval, build a set of real questions with the passages that should answer them and measure how often the right passage is returned in the top results. For generation, measure faithfulness — whether the answer is supported by the retrieved context — usually with a mix of human review and LLM-as-judge scoring, plus answer relevance and latency.
Free Tools
Have a project that needs to ship?
Game Changer Labs designs and builds production systems across AI, neurotech, civic, and spatial computing. Tell us what you are building and we will scope it.
Keep Reading
Get new playbooks by email
Occasional, no-fluff field notes on building production AI — new guides and tools, straight to your inbox. Unsubscribe anytime.