RAG Pipeline — Architecture Case Study

01The problem

Large language models are excellent at language but they don't know your private, current, or domain-specific content — and when asked, they will confidently make something up. In regulated domains like insurance, a wrong answer about a policy clause or a claims rule is not an inconvenience, it is a liability.

RAG closes that gap: instead of relying on what the model memorised, we retrieve the most relevant passages from a trusted knowledge base at query time and ask the model to answer only from that retrieved context, with citations back to the source.

Concrete framing. "Given our underwriting manuals and product documents, answer an underwriter's question and cite the exact sections used." This mirrors the kind of document-grounded assistance I have worked on in document-ingestion and chatbot contexts.

02Key terms in plain English

The handful of words that everything else in this page is built from.

Embedding

A piece of text turned into a list of numbers (a vector) so that similar meanings end up close together. embed("car") ≈ embed("automobile")

Vector & cosine similarity

The vector is the text's coordinates; cosine similarity measures the angle between two vectors to score how related they are (1 = identical, 0 = unrelated). sim = cos(q, d)

Chunk

A document split into small, self-contained pieces so retrieval returns a precise passage, not a whole 80-page PDF. chunk_size=800, overlap=120

Vector store

A database that holds the chunk vectors and finds the nearest ones to a query fast. store.search(q, k=4)

Top-k retrieval

Fetch the k most similar chunks for a query — the candidate evidence the model is allowed to use. k = 4

Re-ranking

A second, sharper model re-scores the candidates so the very best passages rise to the top before they hit the prompt. rerank(q, candidates)[:4]

Grounding & citation

Forcing the answer to come only from retrieved text, and pointing back to the source so a human can verify it. "Answer ONLY from context. Cite [n]."

Hallucination

When a model confidently invents an answer not supported by any source — the exact failure RAG is built to prevent.

03Minimal worked example — the whole thing, broken down

RAG stripped to its essence: no framework, no vector DB — just the five moves that make it work.

1. Knowledge base. A few trusted text snippets (in real life: your chunked documents).
2. Embed. Turn every snippet and the question into vectors with the same model.
3. Retrieve. Score each snippet against the question by cosine similarity, keep the top one.
4. Ground. Put the retrieved snippet into the prompt and instruct the model to answer only from it.
5. Answer. The model responds using the evidence, not its memory.

# RAG in ~20 lines — the whole loop, no libraries beyond an embed + chat call
import numpy as np

docs = [                                            # 1. knowledge base (toy chunks)
    "Policy A-1024 renews annually on 1 March.",
    "Claims under 500 EUR are auto-approved.",
    "Flood damage is excluded from basic cover.",
]
question = "When does policy A-1024 renew?"

doc_vecs = [embed(d) for d in docs]                  # 2. embed the corpus
q_vec = embed(question)                              #    embed the query (same model!)

def cosine(a, b):                                   # 3. similarity = closeness of meaning
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

best = max(range(len(docs)), key=lambda i: cosine(q_vec, doc_vecs[i]))
context = docs[best]                                 #    retrieve top-1 passage

prompt = (                                           # 4. ground the model in that passage
    "Answer ONLY from the context. If unknown, say so.\n"
    f"Context: {context}\nQuestion: {question}"
)
print(chat(prompt))                                  # 5. -> "It renews annually on 1 March."

Scaling up just swaps each toy step for its production counterpart: many chunks instead of three, a vector store instead of a Python loop, and re-ranking + citations on top — the architecture in the next section.

05Components & the decisions behind them

Parsing & cleaning

Extract clean text and structure from messy formats (PDF/DOCX/scans). Layout-aware parsing and OCR matter because tables and headings carry meaning. Bad input here caps the quality of everything downstream.

Chunking

Split documents into retrievable units. I prefer structure-aware chunks (by section/heading) with light overlap, keeping source metadata (doc id, page, section) on every chunk for citations and filtering.

Embeddings & vector store

Encode chunks into vectors and store them with their metadata. The same embedding model must be used for indexing and querying. The store provides fast approximate nearest-neighbour search.

Retrieval & re-ranking

Fetch top-k candidates by similarity, optionally combine with keyword search (hybrid), then re-rank with a cross-encoder so the few passages we pass to the model are genuinely the best ones.

Prompt construction

Assemble retrieved context plus explicit instructions: answer only from context, cite sources, and say "I don't know" when the context is insufficient. This is the main hallucination guardrail.

Generation & citations

The LLM produces the answer grounded in context and returns the source references, so a human can verify every claim — essential for trust in a regulated setting.

06Indexing — annotated snippet

Conceptual, framework-style pseudocode (LangChain-flavoured) to show the shape of the solution.

# --- Offline indexing: build the knowledge base once (refresh on change) ---
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=120,          # overlap preserves context across boundaries
)
chunks = []
for doc in load_documents("./knowledge_base"):   # parse + clean upstream
    for piece in splitter.split_text(doc.text):
        chunks.append({
            "text": piece,
            "metadata": {"doc_id": doc.id, "section": doc.section, "page": doc.page},
        })

embeddings = AzureOpenAIEmbeddings(model="text-embedding-3-large")
store = FAISS.from_texts(
    texts=[c["text"] for c in chunks],
    embedding=embeddings,
    metadatas=[c["metadata"] for c in chunks],   # keep metadata for citations + filtering
)
store.save_local("./index")

07Query — annotated snippet

# --- Online query: retrieve, ground, generate with citations ---
from langchain_openai import AzureChatOpenAI

retriever = store.as_retriever(search_kwargs={"k": 8})       # wide net, then re-rank
top_docs = rerank(query, retriever.invoke(query))[:4]          # cross-encoder keeps the best 4

context = "\n\n".join(
    f"[{i+1}] ({d.metadata['doc_id']} p.{d.metadata['page']})\n{d.page_content}"
    for i, d in enumerate(top_docs)
)
system = (
    "Answer ONLY from the context. Cite sources as [n]. "
    "If the context is insufficient, say you don't know."      # anti-hallucination rule
)
llm = AzureChatOpenAI(temperature=0)                            # low temp = deterministic, factual
answer = llm.invoke([
    {"role": "system", "content": system},
    {"role": "user", "content": f"Question: {query}\n\nContext:\n{context}"},
])

08Key design trade-offs

Decision	Why it matters
Chunk size & overlap	Small chunks = precise retrieval but lost context; large chunks = more context but noisier matches and higher token cost. Tuned per corpus.
Pure vector vs. hybrid search	Vector search captures meaning; keyword search nails exact terms (codes, IDs, clause numbers). Hybrid + re-rank usually wins in document domains.
top-k size	More passages improve recall but dilute the prompt and cost more. I retrieve wide, then re-rank down to a few high-quality passages.
Strict grounding	Forcing "answer only from context / say I don't know" trades a bit of helpfulness for a large gain in trust — the right call in regulated domains.
Index refresh strategy	Documents change. Incremental re-indexing keeps answers current without rebuilding the whole store.

09Evaluation & LLMOps

A RAG system is only trustworthy if it is measured. I separate retrieval quality from generation quality:

Retrieval

Did we fetch the right passages? Track hit-rate / recall@k and context precision on a labelled question set.

Generation

Faithfulness (is every claim supported by context?), answer relevance, and citation correctness.

Operations

Latency, token cost per query, and feedback capture, with versioned prompts and indexes for reproducibility.

Why this split matters: when an answer is wrong, the eval tells me whether retrieval missed the passage or the model ignored it — two very different fixes.

10Limitations & where this goes next

Honest scope: this is an architecture / PoC-level walkthrough to show how I reason about RAG, not a production deployment.
Quality ceiling = data quality: parsing and chunking dominate outcomes; tables and scanned forms need extra care.
Next step — agentic RAG: let the system decide when to retrieve, reformulate weak queries, and verify its own answer — covered in my LangGraph orchestration and agentic systems case studies.
Guardrails: add PII handling, access control on the index, and prompt-injection defenses before real use.

Retrieval-Augmented Generation (RAG) over enterprise documents

01The problem

02Key terms in plain English

Embedding

Vector & cosine similarity

Chunk

Vector store

Top-k retrieval

Re-ranking

Grounding & citation

Hallucination

03Minimal worked example — the whole thing, broken down

04Architecture at a glance

05Components & the decisions behind them

Parsing & cleaning

Chunking

Embeddings & vector store

Retrieval & re-ranking

Prompt construction

Generation & citations

06Indexing — annotated snippet

07Query — annotated snippet

08Key design trade-offs

09Evaluation & LLMOps

Retrieval

Generation

Operations

10Limitations & where this goes next