Moving beyond a single prompt: how I structure multi-step LLM applications as an explicit, stateful graph — with branching, tool calls, retries, checkpointing and a human-in-the-loop — so the system is controllable, debuggable and safe to run in production.
A single prompt works for a single turn. Real tasks need several steps — classify, retrieve, call a tool,
validate, maybe loop back and try again. Wiring that as ad-hoc if/else around chained
prompts becomes brittle fast: there is no shared state, no clean way to retry one step, and no visibility into
why a run went the way it did.
LangGraph models the workflow as a directed graph of nodes (units of work) and edges (what happens next), all operating over a shared, typed state. That makes the flow explicit, inspectable, resumable, and testable — the difference between a demo and something you can operate.
The vocabulary of a LangGraph workflow, before any code.
One typed object that travels through the whole workflow; every step reads and updates it instead of using hidden globals. class State(TypedDict): ...
A plain function that does one job (classify, retrieve, generate) and returns the updated state. def generate(state): ...
A fixed "after A, do B" connection between two nodes. g.add_edge("retrieve", "generate")
Routing that looks at the state to choose the next node — this is where branching and loops live. g.add_conditional_edges("grade", route)
A node that calls an external system (search, API, calculator) and feeds the result back into state.
Saves the state after each step so a run can pause, resume, or recover — and a human can step in mid-flow. g.compile(checkpointer=memory)
An interrupt where a person approves or corrects before the graph continues — for high-stakes steps.
The graph's entry and exit markers; routing to END finishes the run. g.add_edge(START, "retrieve")
The smallest useful graph: generate, self-check, and loop back a bounded number of times.
# The smallest self-correcting graph — generate -> check -> (loop or stop) from langgraph.graph import StateGraph, START, END from typing import TypedDict class State(TypedDict): # 1. shared state question: str draft: str ok: bool attempts: int def generate(state): # 2a. node: produce a draft state["draft"] = llm_answer(state["question"]) state["attempts"] += 1 return state def check(state): # 2b. node: self-grade the draft state["ok"] = is_good(state["draft"]) return state def route(state): # 3. conditional edge = control logic if state["ok"]: return END # good enough -> stop if state["attempts"] < 3: return "generate" # retry (bounded!) return END # give up gracefully g = StateGraph(State) # 4. wire it together g.add_node("generate", generate); g.add_node("check", check) g.add_edge(START, "generate"); g.add_edge("generate", "check") g.add_conditional_edges("check", route) app = g.compile() app.invoke({"question": "Is flood damage covered?", "attempts": 0})
attempts < 3) is the whole point: self-correction without a
budget loops forever and burns tokens. The fuller flow below adds retrieval and a human escalation path.
A graph that answers a question over documents, checks whether it actually has enough grounding, and loops to refine the query if not — escalating to a human when confidence stays low.
A single typed object passed between nodes (question, retrieved context, draft, attempt count, confidence). Every node reads and updates it — no hidden globals.
Plain functions that do one thing: classify, retrieve, generate, grade. Easy to unit-test in isolation.
Routing logic that inspects the state to decide the next node — this is where branching, looping and escalation live.
Nodes that call external tools/APIs (search, calculators, internal systems) and feed results back into state.
Persist state after each step so a run can pause, resume, or recover from a crash — and so a human can step in mid-flow.
An interrupt point where a person approves or corrects before the graph continues — vital for high-stakes actions.
Conceptual LangGraph-style pseudocode showing state, nodes and the conditional loop.
from langgraph.graph import StateGraph, START, END from typing import TypedDict, List class State(TypedDict): # shared, typed state flows through every node question: str context: List[str] draft: str attempts: int grounded: bool def retrieve(state: State) -> State: state["context"] = search(state["question"]) # tool/retriever call return state def generate(state: State) -> State: state["draft"] = llm_answer(state["question"], state["context"]) return state def grade(state: State) -> State: state["grounded"] = is_supported(state["draft"], state["context"]) # self-check state["attempts"] += 1 return state def route(state: State) -> str: # conditional edge = the control logic if state["grounded"]: return "answer" if state["attempts"] < 3: return "retrieve" # loop back and refine return "human" # give up gracefully → escalate g = StateGraph(State) g.add_node("retrieve", retrieve); g.add_node("generate", generate) g.add_node("grade", grade); g.add_node("human", human_review) g.add_edge(START, "retrieve"); g.add_edge("retrieve", "generate"); g.add_edge("generate", "grade") g.add_conditional_edges("grade", route, {"answer": END, "retrieve": "retrieve", "human": "human"}) app = g.compile(checkpointer=memory) # checkpointing → resumable + human-in-the-loop
| Decision | Why it matters |
|---|---|
| Explicit graph vs. a single mega-prompt | More upfront structure, but each step becomes testable, retryable and observable — far easier to operate and debug. |
| Bounded loops (max attempts) | Self-correction is powerful but can spin forever or burn tokens. A hard cap plus graceful escalation keeps cost and latency predictable. |
| Where to put the human | Too many approvals kill throughput; too few are risky. I gate only high-stakes or low-confidence steps. |
| Checkpoint granularity | Per-node checkpoints enable resume and audit, at some storage cost — worth it for long or critical flows. |
| Graph vs. autonomous agent | A fixed graph is predictable; a free agent is flexible. I reach for a graph when the steps are known, and lean agentic when they are not — see the agentic systems study. |
Every node, prompt, tool call and state transition is traced, so a run reads like a story you can replay.
Test the graph end-to-end and node-by-node against fixed cases; track success rate, loop counts and escalation rate.
Measure tokens and time per node to find the expensive step before scaling out.