How AI agents actually work (under the hood)
The reasoning loop
Strip away the libraries and frameworks, and an AI agent is one tight loop:
while not done:
plan = model.think(goal, context, history)
action = plan.next_action # call a tool, ask a question, finish
if action == "finish":
break
result = execute(action) # actually run the tool
history.append((action, result))
context = update_context(history)
That's it. The model thinks, picks an action, the system executes it, the result feeds back into the next iteration. Everything else — function calling APIs, retrieval, guardrails, observability — is plumbing around this loop.
Layer 1: the reasoning model
The LLM does the planning step. In 2026 the production-grade options:
| Model | Best at | Caveat |
|---|---|---|
| Claude Sonnet 4.6 / 4.7 | Reliable tool calling, long-context reasoning, structured outputs | Most expensive |
| GPT-4o / GPT-4.1 | General-purpose, voice agents, large ecosystem | Slightly weaker at long-context |
| Gemini 2.5 Pro / 2.0 Flash | Long context (1M+ tokens), vision, cost-sensitive workloads | Tool-calling stability less mature |
| Llama 3.3 70B (self-hosted) | Privacy-sensitive deployments | Engineering overhead to host |
For most production agents, Claude Sonnet is our default — its tool-calling reliability and structured-output behavior is the highest in the field as of early 2026.
For deeper model comparison see our ChatGPT vs Claude vs Gemini post.
Layer 2: tool calling
Tools are what turn the LLM from a text generator into something that acts on the world.
In the model SDK, tools are typed:
const findAvailabilityTool = {
name: "findAvailability",
description: "Find open appointment slots on a given date",
input_schema: {
type: "object",
properties: {
date: { type: "string", format: "date" },
durationMinutes: { type: "number" },
},
required: ["date", "durationMinutes"],
},
};
When the model decides to call this tool, it emits structured JSON:
{
"tool_name": "findAvailability",
"input": { "date": "2026-05-26", "durationMinutes": 30 }
}
Your runtime validates the JSON against the schema, calls the actual function, returns the result to the model, and the model continues planning.
Three things make tools work in production:
- Typed schemas — the model can't invent fields; invalid JSON fails at the boundary.
- Idempotency keys — calling the same tool twice with the same args is safe (no duplicate bookings, payments, etc.).
- Failure semantics — clear errors when a tool fails so the model can recover (retry, ask a clarifying question, escalate to human).
Skip any of these and the agent breaks in production within days.
Layer 3: retrieval
The LLM doesn't know your data. Retrieval is how it finds out.
Three retrieval patterns:
Inline — the model fetches what it needs
model: I need to know which POs exist for vendor X from last month.
agent runtime: [calls searchPurchaseOrders({vendor: 'X', from: '2026-04-01'})]
runtime returns: [{id: 'PO-1234', total: 1500}, ...]
model: Continues planning with PO data in context.
Best for: when the right query depends on the goal.
Pre-fetched — the system fills context up front
agent runtime: Embeds the user goal, retrieves top-10 relevant chunks, builds the system prompt with those chunks included.
model: Reasons with everything already in context.
Best for: knowledge-base Q&A where the relevant content is predictable from the query.
Hybrid
Most production agents combine both: pre-fetch the obvious context, let the model inline-call retrieval for anything else.
For production RAG patterns see our RAG patterns post.
Layer 4: guardrails
Without guardrails, an agent can do things you very much don't want it to.
Production guardrails:
- Approval gates on irreversible actions. The agent stops in a "pending approval" state until a designated human approves via Slack / dashboard / email.
- Spend caps per agent run. Hard stop if LLM token cost or downstream API cost exceeds threshold.
- Step ceilings. Maximum N tool calls per agent run. Beyond N, the agent gives up cleanly.
- Scope refusals. System prompt defines what's in scope; user inputs that try to push outside scope get politely refused.
- Prompt-injection detection. User-supplied content (emails, document contents) gets sanitized; "ignore previous instructions" patterns are detected and refused.
We design every agent assuming it will be attacked. Sometimes that's adversarial; usually it's accidental (a user being weird, a document containing unexpected content). Guardrails prevent the agent from doing something it shouldn't.
Layer 5: evals
How you know the agent is good.
An eval suite is a set of representative cases — typical inputs and expected behaviors — that we run on every prompt or model change. Three flavors:
Snapshot evals
For each input, the expected output is fixed. The eval checks for exact or semantic match. Useful for: structured extraction, classification, deterministic outputs.
Behavior evals
For each input, the expected behavior is described. A grader (often another LLM, sometimes a human) checks whether the actual behavior matches. Useful for: judgment-heavy tasks where there are many valid answers.
Live evals
Sampled real-production traces, scored after the fact. Useful for catching drift you didn't anticipate.
We build evals during discovery. They run in CI. They surface scores on a dashboard. Without them, the agent silently degrades.
Layer 6: observability
Per-trace logging. Every input, every tool call, every model decision, every output, every cost.
We use Langfuse as the default observability layer for AI-touched flows. It gives us:
- Per-trace timeline (what happened, in order, with timing).
- Cost attribution per trace.
- Filter / search by user, by outcome, by tool used.
- Replay any past trace.
- Eval scores attached to traces.
Plus structured logs to Sentry for errors and a dashboard for the operator (success rate, cost, latency, queue depth).
Without observability you can't debug, evaluate, or improve. The first day after deploy you'll need to know why something went wrong; observability is how.
Putting it together
A production-grade agent has, at minimum:
- A defined goal interface (webhook, button, schedule).
- A reasoning model (Claude / GPT / Gemini), abstracted behind a provider interface.
- A set of typed tools with idempotency and failure semantics.
- Retrieval against your data.
- Guardrails: approval gates, spend caps, step ceilings, scope, prompt-injection detection.
- An eval suite that runs in CI.
- Per-trace observability with a dashboard.
Skip any of those and the agent works in the demo and fails in production. Honor them all and the agent is no more mysterious than a well-written background worker — just one that happens to use an LLM as its decision engine.
Where it pairs
If you want to see how this architecture lands in concrete agent shapes, see our agent type pages — six concrete shapes with stacks, costs, and examples.
If you want to see it shipped, the Document Intake Agent case study walks through every layer in a real AP automation build.
If you have an agent in mind and want a feasibility take, drop us a note — one paragraph is enough.
Frequently asked questions
Keep reading
What is an AI agent? The full breakdown
An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.
RAG done right: the patterns that survive production
Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.
ChatGPT API vs Claude API vs Gemini: which to pick (2026)
Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Workflow Orchestrator Agent
Cross-SaaS triggers — Microsoft 365, Slack, Sheets, HubSpot, Stripe — with idempotency and approvals
Want this delivered in your stack?
If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.