Milton Martin
Proof of concept · Architecture exploration

Agentic AI: systems that plan, use tools, and check their work

When the steps to solve a task aren't known in advance, a fixed pipeline isn't enough. This study shows how I think about agents — an LLM that reasons, calls tools, keeps memory, and coordinates with other specialised agents — and, just as importantly, how to keep such a system bounded, observable and evaluable.

Tool calling Planner / Executor Multi-agent MCP Memory Agent evaluation

01What makes a system "agentic"

A workflow follows a path I designed. An agent decides the path itself: given a goal, it reasons about what to do, picks a tool, observes the result, and repeats until the goal is met. That autonomy is what makes agents powerful for open-ended tasks — and what makes them risky if left unbounded.

The four ingredients I look for:

Reasoning

An LLM that can break a goal into steps and decide the next action (the "think" in think–act–observe).

Tools

Functions the agent can call — search, retrieval, calculators, internal APIs — to act on the world and gather facts.

Memory

Short-term scratchpad for the current task plus longer-term memory across sessions.

Feedback loop

Observe each tool result and adjust — including self-critique to catch its own mistakes.

Honest positioning. I treat agentic AI at an exploration / PoC level today — strong on the patterns, trade-offs and guardrails, and actively building hands-on depth. This page is how I would design and reason about such a system.

02Key terms in plain English

The building blocks behind the word "agentic".

Agent

An LLM that, given a goal, decides its own next action, runs it, looks at the result, and repeats until done — rather than following a path you hard-coded.

Tool

A typed, well-described function the agent may call to act or gather facts (search, retrieval, an internal API). @tool def get_policy(id): ...

ReAct loop

The core rhythm: think (decide), act (call a tool), observe (read the result), repeat. think -> act -> observe

Planner / Executor

A supervisor breaks a goal into sub-tasks and delegates them to focused worker agents, then assembles the results.

Critic

A verifier agent that scores or rejects a result before it is finalised — the quality gate of a multi-agent system.

Memory

Short-term scratchpad for the current task plus longer-term memory carried across sessions.

MCP

Model Context Protocol — a standard way to expose tools and data so any MCP-aware agent can discover and call them, instead of bespoke glue per integration.

Stopping condition

Explicit "done" criteria plus a step/token budget so the agent can't loop forever. while steps < MAX:

03Minimal worked example — the whole thing, broken down

An agent is, at heart, a loop around an LLM that can call tools. Here it is with one tool and a budget.

# A minimal ReAct agent — one tool, a hard step budget, no framework magic
def get_policy_status(policy_id):                 # 1. the only tool we expose
    return crm.lookup(policy_id)                    #    calls an internal system

tools = {"get_policy_status": get_policy_status}
goal = "Is policy A-1024 active and when does it renew?"
scratch, MAX = [], 5                              # memory + stopping condition

for step in range(MAX):                          # the agent loop
    decision = llm_decide(goal, scratch, tools)    # 2. THINK: tool call or final answer?
    if decision.is_final:
        print(decision.answer); break             #    done -> stop
    result = tools[decision.tool](**decision.args) # 3. ACT: run the chosen tool
    scratch.append((decision.tool, result))        # 4. OBSERVE: remember, then loop
else:
    print("Step budget exhausted — escalating to a human.")   # guardrail
Everything else is hardening this loop: more (and safer) tools, a planner that splits the goal across worker agents, a critic that checks the answer, and human approval on state-changing actions — the design in the next section.

04A planner / executor, multi-agent design

A supervisor (planner) decomposes the goal and delegates to specialised agents, each with its own tools; a critic verifies before anything is finalised.

User goal natural language Supervisor / Planner decompose & delegate Research agent retrieval / search Action agent calls internal APIs Critic agent verify & score Vector DB Web / search Internal API MCP servers Result + trace revise if rejected on approval
Supervisor delegates to specialised agents; cyan = tool/MCP calls; the critic gates the result.

05Tools & MCP — how agents touch the real world

An agent is only as useful as the tools it can call. Each tool needs a clear name, description and typed schema so the model knows when and how to use it. The Model Context Protocol (MCP) standardises this: tools and data sources are exposed by MCP servers and any MCP-aware agent can discover and call them — instead of bespoke glue per integration.

# A tool is just a typed, well-described function the agent can choose to call
from langchain_core.tools import tool

@tool
def get_policy_status(policy_id: str) -> dict:
    """Return the current status and key dates for an insurance policy.
    Use when the user asks about a specific policy by its ID."""   # description guides tool choice
    return crm.lookup(policy_id)                                      # calls an internal system

tools = [get_policy_status, search_documents, calculator]
agent = create_agent(llm, tools, system=POLICY)   # model decides which tool, when, with what args
result = agent.invoke({"goal": "Is policy A-1024 active and when does it renew?"})
Guardrail first. Tools that change state (write/act) get validation, allow-lists and — for high-stakes actions — a human approval step, reusing the human-in-the-loop pattern from my LangGraph study.

06Design trade-offs

DecisionWhy it matters
Single agent vs. multi-agentOne agent is simpler; specialised agents give cleaner prompts, focused tools and easier evaluation — at the cost of coordination overhead.
Autonomy vs. controlMore freedom solves more tasks but is harder to predict. I bound steps, budgets and tool scope, and keep a fixed graph where the steps are actually known.
Tool granularityMany narrow tools are easier for the model to use correctly than a few overloaded ones — but too many bloats the context.
Memory strategyPersisting everything is costly and leaks irrelevant context; I summarise and retrieve memory on demand.
Stopping conditionsWithout explicit success criteria and a step/token budget, agents loop. Define "done" up front.

07Evaluating & operating agents

Agents are non-deterministic, so evaluation is not optional — it is the control system.

Task success

Did it reach the goal? Outcome-based scoring on a fixed suite of representative tasks.

Trajectory quality

Were the right tools called in a sensible order, without wasteful loops or unsafe actions?

Cost & safety

Steps, tokens and latency per task, plus guardrail hits — caught before users do.

LLM-as-judge, carefully. Automated graders scale evaluation, but I anchor them with a human-labelled gold set so the judge itself stays honest.

08How the three studies fit together