Milton Martin
Proof of concept · Architecture exploration

Retrieval-Augmented Generation (RAG) over enterprise documents

How I design a grounded question-answering pipeline that answers from a company's own documents — with citations, guardrails against hallucination, and an evaluation loop — rather than from the model's memory alone.

Python LangChain Vector DB Azure OpenAI Embeddings Re-ranking Evaluation

01The problem

Large language models are excellent at language but they don't know your private, current, or domain-specific content — and when asked, they will confidently make something up. In regulated domains like insurance, a wrong answer about a policy clause or a claims rule is not an inconvenience, it is a liability.

RAG closes that gap: instead of relying on what the model memorised, we retrieve the most relevant passages from a trusted knowledge base at query time and ask the model to answer only from that retrieved context, with citations back to the source.

Concrete framing. "Given our underwriting manuals and product documents, answer an underwriter's question and cite the exact sections used." This mirrors the kind of document-grounded assistance I have worked on in document-ingestion and chatbot contexts.

02Key terms in plain English

The handful of words that everything else in this page is built from.

Embedding

A piece of text turned into a list of numbers (a vector) so that similar meanings end up close together. embed("car") ≈ embed("automobile")

Vector & cosine similarity

The vector is the text's coordinates; cosine similarity measures the angle between two vectors to score how related they are (1 = identical, 0 = unrelated). sim = cos(q, d)

Chunk

A document split into small, self-contained pieces so retrieval returns a precise passage, not a whole 80-page PDF. chunk_size=800, overlap=120

Vector store

A database that holds the chunk vectors and finds the nearest ones to a query fast. store.search(q, k=4)

Top-k retrieval

Fetch the k most similar chunks for a query — the candidate evidence the model is allowed to use. k = 4

Re-ranking

A second, sharper model re-scores the candidates so the very best passages rise to the top before they hit the prompt. rerank(q, candidates)[:4]

Grounding & citation

Forcing the answer to come only from retrieved text, and pointing back to the source so a human can verify it. "Answer ONLY from context. Cite [n]."

Hallucination

When a model confidently invents an answer not supported by any source — the exact failure RAG is built to prevent.

03Minimal worked example — the whole thing, broken down

RAG stripped to its essence: no framework, no vector DB — just the five moves that make it work.

# RAG in ~20 lines — the whole loop, no libraries beyond an embed + chat call
import numpy as np

docs = [                                            # 1. knowledge base (toy chunks)
    "Policy A-1024 renews annually on 1 March.",
    "Claims under 500 EUR are auto-approved.",
    "Flood damage is excluded from basic cover.",
]
question = "When does policy A-1024 renew?"

doc_vecs = [embed(d) for d in docs]                  # 2. embed the corpus
q_vec = embed(question)                              #    embed the query (same model!)

def cosine(a, b):                                   # 3. similarity = closeness of meaning
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

best = max(range(len(docs)), key=lambda i: cosine(q_vec, doc_vecs[i]))
context = docs[best]                                 #    retrieve top-1 passage

prompt = (                                           # 4. ground the model in that passage
    "Answer ONLY from the context. If unknown, say so.\n"
    f"Context: {context}\nQuestion: {question}"
)
print(chat(prompt))                                  # 5. -> "It renews annually on 1 March."
Scaling up just swaps each toy step for its production counterpart: many chunks instead of three, a vector store instead of a Python loop, and re-ranking + citations on top — the architecture in the next section.

04Architecture at a glance

Two pipelines: an offline indexing pipeline that prepares the knowledge base, and an online query pipeline that answers in real time.

Offline · Indexing pipeline Documents PDF · DOCX · HTML Parse & clean OCR · layout Chunk overlap · metadata Embed embedding model Vector store index + metadata Online · Query pipeline User question + chat history Embed query same model Retrieve top-k + re-rank + metadata filter Build prompt context + rules LLM answer + citations indexed vectors used at query time
Indexing prepares the knowledge base once; the query pipeline runs per request.

05Components & the decisions behind them

Parsing & cleaning

Extract clean text and structure from messy formats (PDF/DOCX/scans). Layout-aware parsing and OCR matter because tables and headings carry meaning. Bad input here caps the quality of everything downstream.

Chunking

Split documents into retrievable units. I prefer structure-aware chunks (by section/heading) with light overlap, keeping source metadata (doc id, page, section) on every chunk for citations and filtering.

Embeddings & vector store

Encode chunks into vectors and store them with their metadata. The same embedding model must be used for indexing and querying. The store provides fast approximate nearest-neighbour search.

Retrieval & re-ranking

Fetch top-k candidates by similarity, optionally combine with keyword search (hybrid), then re-rank with a cross-encoder so the few passages we pass to the model are genuinely the best ones.

Prompt construction

Assemble retrieved context plus explicit instructions: answer only from context, cite sources, and say "I don't know" when the context is insufficient. This is the main hallucination guardrail.

Generation & citations

The LLM produces the answer grounded in context and returns the source references, so a human can verify every claim — essential for trust in a regulated setting.

06Indexing — annotated snippet

Conceptual, framework-style pseudocode (LangChain-flavoured) to show the shape of the solution.

# --- Offline indexing: build the knowledge base once (refresh on change) ---
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=120,          # overlap preserves context across boundaries
)
chunks = []
for doc in load_documents("./knowledge_base"):   # parse + clean upstream
    for piece in splitter.split_text(doc.text):
        chunks.append({
            "text": piece,
            "metadata": {"doc_id": doc.id, "section": doc.section, "page": doc.page},
        })

embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-large")
store = FAISS.from_texts(
    texts=[c["text"] for c in chunks],
    embedding=embeddings,
    metadatas=[c["metadata"] for c in chunks],   # keep metadata for citations + filtering
)
store.save_local("./index")

07Query — annotated snippet

# --- Online query: retrieve, ground, generate with citations ---
from langchain_openai import AzureChatOpenAI

retriever = store.as_retriever(search_kwargs={"k": 8})       # wide net, then re-rank
top_docs = rerank(query, retriever.invoke(query))[:4]          # cross-encoder keeps the best 4

context = "\n\n".join(
    f"[{i+1}] ({d.metadata['doc_id']} p.{d.metadata['page']})\n{d.page_content}"
    for i, d in enumerate(top_docs)
)
system = (
    "Answer ONLY from the context. Cite sources as [n]. "
    "If the context is insufficient, say you don't know."      # anti-hallucination rule
)
llm = AzureChatOpenAI(temperature=0)                            # low temp = deterministic, factual
answer = llm.invoke([
    {"role": "system", "content": system},
    {"role": "user", "content": f"Question: {query}\n\nContext:\n{context}"},
])

08Key design trade-offs

DecisionWhy it matters
Chunk size & overlapSmall chunks = precise retrieval but lost context; large chunks = more context but noisier matches and higher token cost. Tuned per corpus.
Pure vector vs. hybrid searchVector search captures meaning; keyword search nails exact terms (codes, IDs, clause numbers). Hybrid + re-rank usually wins in document domains.
top-k sizeMore passages improve recall but dilute the prompt and cost more. I retrieve wide, then re-rank down to a few high-quality passages.
Strict groundingForcing "answer only from context / say I don't know" trades a bit of helpfulness for a large gain in trust — the right call in regulated domains.
Index refresh strategyDocuments change. Incremental re-indexing keeps answers current without rebuilding the whole store.

09Evaluation & LLMOps

A RAG system is only trustworthy if it is measured. I separate retrieval quality from generation quality:

Retrieval

Did we fetch the right passages? Track hit-rate / recall@k and context precision on a labelled question set.

Generation

Faithfulness (is every claim supported by context?), answer relevance, and citation correctness.

Operations

Latency, token cost per query, and feedback capture, with versioned prompts and indexes for reproducibility.

Why this split matters: when an answer is wrong, the eval tells me whether retrieval missed the passage or the model ignored it — two very different fixes.

10Limitations & where this goes next