How I design a grounded question-answering pipeline that answers from a company's own documents — with citations, guardrails against hallucination, and an evaluation loop — rather than from the model's memory alone.
Large language models are excellent at language but they don't know your private, current, or domain-specific content — and when asked, they will confidently make something up. In regulated domains like insurance, a wrong answer about a policy clause or a claims rule is not an inconvenience, it is a liability.
RAG closes that gap: instead of relying on what the model memorised, we retrieve the most relevant passages from a trusted knowledge base at query time and ask the model to answer only from that retrieved context, with citations back to the source.
The handful of words that everything else in this page is built from.
A piece of text turned into a list of numbers (a vector) so that similar meanings end up close together. embed("car") ≈ embed("automobile")
The vector is the text's coordinates; cosine similarity measures the angle between two vectors to score how related they are (1 = identical, 0 = unrelated). sim = cos(q, d)
A document split into small, self-contained pieces so retrieval returns a precise passage, not a whole 80-page PDF. chunk_size=800, overlap=120
A database that holds the chunk vectors and finds the nearest ones to a query fast. store.search(q, k=4)
Fetch the k most similar chunks for a query — the candidate evidence the model is allowed to use. k = 4
A second, sharper model re-scores the candidates so the very best passages rise to the top before they hit the prompt. rerank(q, candidates)[:4]
Forcing the answer to come only from retrieved text, and pointing back to the source so a human can verify it. "Answer ONLY from context. Cite [n]."
When a model confidently invents an answer not supported by any source — the exact failure RAG is built to prevent.
RAG stripped to its essence: no framework, no vector DB — just the five moves that make it work.
# RAG in ~20 lines — the whole loop, no libraries beyond an embed + chat call import numpy as np docs = [ # 1. knowledge base (toy chunks) "Policy A-1024 renews annually on 1 March.", "Claims under 500 EUR are auto-approved.", "Flood damage is excluded from basic cover.", ] question = "When does policy A-1024 renew?" doc_vecs = [embed(d) for d in docs] # 2. embed the corpus q_vec = embed(question) # embed the query (same model!) def cosine(a, b): # 3. similarity = closeness of meaning return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) best = max(range(len(docs)), key=lambda i: cosine(q_vec, doc_vecs[i])) context = docs[best] # retrieve top-1 passage prompt = ( # 4. ground the model in that passage "Answer ONLY from the context. If unknown, say so.\n" f"Context: {context}\nQuestion: {question}" ) print(chat(prompt)) # 5. -> "It renews annually on 1 March."
Two pipelines: an offline indexing pipeline that prepares the knowledge base, and an online query pipeline that answers in real time.
Extract clean text and structure from messy formats (PDF/DOCX/scans). Layout-aware parsing and OCR matter because tables and headings carry meaning. Bad input here caps the quality of everything downstream.
Split documents into retrievable units. I prefer structure-aware chunks (by section/heading) with light overlap, keeping source metadata (doc id, page, section) on every chunk for citations and filtering.
Encode chunks into vectors and store them with their metadata. The same embedding model must be used for indexing and querying. The store provides fast approximate nearest-neighbour search.
Fetch top-k candidates by similarity, optionally combine with keyword search (hybrid), then re-rank with a cross-encoder so the few passages we pass to the model are genuinely the best ones.
Assemble retrieved context plus explicit instructions: answer only from context, cite sources, and say "I don't know" when the context is insufficient. This is the main hallucination guardrail.
The LLM produces the answer grounded in context and returns the source references, so a human can verify every claim — essential for trust in a regulated setting.
Conceptual, framework-style pseudocode (LangChain-flavoured) to show the shape of the solution.
# --- Offline indexing: build the knowledge base once (refresh on change) --- from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import AzureOpenAIEmbeddings from langchain_community.vectorstores import FAISS splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=120, # overlap preserves context across boundaries ) chunks = [] for doc in load_documents("./knowledge_base"): # parse + clean upstream for piece in splitter.split_text(doc.text): chunks.append({ "text": piece, "metadata": {"doc_id": doc.id, "section": doc.section, "page": doc.page}, }) embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-large") store = FAISS.from_texts( texts=[c["text"] for c in chunks], embedding=embeddings, metadatas=[c["metadata"] for c in chunks], # keep metadata for citations + filtering ) store.save_local("./index")
# --- Online query: retrieve, ground, generate with citations --- from langchain_openai import AzureChatOpenAI retriever = store.as_retriever(search_kwargs={"k": 8}) # wide net, then re-rank top_docs = rerank(query, retriever.invoke(query))[:4] # cross-encoder keeps the best 4 context = "\n\n".join( f"[{i+1}] ({d.metadata['doc_id']} p.{d.metadata['page']})\n{d.page_content}" for i, d in enumerate(top_docs) ) system = ( "Answer ONLY from the context. Cite sources as [n]. " "If the context is insufficient, say you don't know." # anti-hallucination rule ) llm = AzureChatOpenAI(temperature=0) # low temp = deterministic, factual answer = llm.invoke([ {"role": "system", "content": system}, {"role": "user", "content": f"Question: {query}\n\nContext:\n{context}"}, ])
| Decision | Why it matters |
|---|---|
| Chunk size & overlap | Small chunks = precise retrieval but lost context; large chunks = more context but noisier matches and higher token cost. Tuned per corpus. |
| Pure vector vs. hybrid search | Vector search captures meaning; keyword search nails exact terms (codes, IDs, clause numbers). Hybrid + re-rank usually wins in document domains. |
| top-k size | More passages improve recall but dilute the prompt and cost more. I retrieve wide, then re-rank down to a few high-quality passages. |
| Strict grounding | Forcing "answer only from context / say I don't know" trades a bit of helpfulness for a large gain in trust — the right call in regulated domains. |
| Index refresh strategy | Documents change. Incremental re-indexing keeps answers current without rebuilding the whole store. |
A RAG system is only trustworthy if it is measured. I separate retrieval quality from generation quality:
Did we fetch the right passages? Track hit-rate / recall@k and context precision on a labelled question set.
Faithfulness (is every claim supported by context?), answer relevance, and citation correctness.
Latency, token cost per query, and feedback capture, with versioned prompts and indexes for reproducibility.