agents

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

What an AI agent actually is (and isn't)

An AI agent is a system that turns a goal into a sequence of tool calls. Give it an objective — "extract every line item from this PDF and post it to NetSuite" — and it plans the steps, picks the right tools (a vision model, a schema validator, an API call to your ERP), executes them, watches for failure modes, and either finishes the job or hands off to a human with context.

That is materially different from:

  • A chatbot (a single-turn or short conversational interface, usually without tool calls)
  • A ChatGPT integration (a thin wrapper around a hosted model)
  • A prompt template (one static prompt with variables substituted in)

The difference matters because agents are how AI starts replacing real work. Chatbots answer questions. Agents complete jobs.

If you want the deeper definition, our What is an AI agent guide walks through the anatomy with diagrams.

When you actually need an agent

You need an agent (vs. plain automation, vs. a custom build, vs. a SaaS tool) when all of these are true:

  1. The work involves unstructured inputs — PDFs, emails, voice, free-form chat, scanned forms.
  2. Each instance requires judgement — not a fixed rule (if X then Y), but a decision based on context.
  3. The volume is non-trivial — at least dozens of items per day, ideally hundreds or thousands.
  4. Errors are visible — you can detect when the agent gets something wrong, either via downstream signal (a returned invoice, a complaint) or via a human review queue.

If your workflow is deterministic (always the same rules, structured input), don't build an agent — build automation. Our AI agents vs automation deep-dive walks through the decision in detail.

How we build agents — the six-phase loop

Every agent we ship goes through the same six phases. The structure exists because the failure modes are predictable: skip discovery and you build the wrong thing; skip evals and you can't tell when it regresses; skip observability and you can't fix it in production.

1. Discover — one to two weeks

We sit with the people doing the actual work. We map inputs, outputs, and the messy bits the documentation always misses. We identify the success metric in finance-grade language — hours saved, errors avoided, cycle time cut — not in vanity AI metrics like "accuracy." We write the spec.

Deliverables: a workflow map, a ranked opportunity list, success criteria, a draft agent specification, and a no-go decision template.

2. Design — three to five days

We pick the right shape: agent vs. automation vs. SaaS vs. custom build. We pick the LLM provider (Claude / OpenAI / Gemini / self-hosted) based on the task. We pick the framework (LangGraph for complex multi-step, plain SDK for simple jobs, Anthropic's Computer Use API for browser tasks). We write the architecture diagram and the cost/timeline estimate.

Deliverables: architecture diagram, risk register, cost & timeline estimate, signed spec.

3. Prototype — two to three weeks

We build a working slice on real data, real failure modes, real numbers. Not slideware. The prototype is end-to-end — it ingests the actual inputs, calls the actual tools, produces the actual outputs — but with thin scaffolding around the parts that need production hardening. We run it side by side with the human process for a week.

Deliverables: working prototype, evaluation results on real data, go/no-go decision.

4. Build — three to eight weeks

Production engineering. Authentication. Logging structured for replay. Idempotency keys on every action. Retry policies. Rate limit handling. Cost guardrails. Tests, types, and code review. CI/CD pipeline configured for your team. Eval suite running on every PR. Observability dashboard.

Deliverables: production code, CI/CD, test suite, eval dashboard, runbooks.

5. Deploy — one week

We roll out in waves: 10% of traffic, 50%, 100%. We watch the dashboards. We compare outputs to the human baseline. We tune. We document what we learn.

Deliverables: phased rollout, on-call rota, monitoring dashboards, post-launch summary.

6. Iterate — ongoing

Agents drift. Models upgrade. Schemas change. We keep a retainer slot for monthly eval runs, prompt tuning, and the inevitable "while you're at it" features.

Deliverables: monthly evals, prompt versioning, quarterly roadmap.

The tech we use

We optimise for shipping, not for résumé-driven engineering. Our typical stack:

LayerDefault choiceWhy
Reasoning modelClaude Sonnet 4.6 / 4.7Best reliability for tool calling and long-context reasoning
Vision modelClaude or Gemini 2.0 FlashStrong at structured extraction from PDFs and images
Voice modelGPT-4o Realtime / WhisperSub-second response, multilingual, robust to accents
OrchestrationLangGraph + plain SDKLangGraph for multi-step agents with branches; plain SDK when one tool call is enough
Retrievalpgvector / Pinecone / Vectorizepgvector if you already have Postgres; managed for everything else
ObservabilityLangfuse / Helicone / OpenTelemetryPer-call traces, cost attribution, regression detection
Eval frameworkCustom + Promptfoo / BraintrustPromptfoo for offline, Braintrust for production traces
Deploy targetFirebase / Vercel / your cloudFirebase App Hosting is our default; we go wherever your data lives
Tooling glueTypeScript everywhereOne language end-to-end keeps the team small

We are provider-agnostic by design. Every agent is abstracted behind a thin interface so you can swap models without rewriting the agent.

Pricing and timeline — the honest version

Generic agency answers like "it depends" waste your time. Here are real ranges from real engagements.

EngagementDurationInvestment
Discovery sprint (workflow map, spec, no/go)1–2 weeks€4,000–6,000
Working prototype on real data2–3 weeks€8,000–15,000
Production agent (single-purpose)6–10 weeks€25,000–50,000
Production agent (multi-purpose / multi-channel)10–16 weeks€50,000–100,000
Ongoing retainer (evals, tuning, on-call)Monthlyfrom €2,000/month

We always quote firm before work begins. If we hit the discovery and decide the agent is the wrong shape, we tell you and refund the rest of the engagement. That has happened twice in our history and we'd do it again.

What an actual agent looks like in production

Two short examples from real builds.

Document intake — accounts payable

A mid-market distributor was keying ~400 supplier invoices per week into NetSuite. The team wanted "AI." What we shipped:

  1. Ingestion — Resend webhook captures invoice emails into a Firestore queue.
  2. Vision extraction — Claude vision pass with a Zod schema for line items, totals, tax codes.
  3. PO matching — agent calls NetSuite to find candidate POs, applies tolerance rules (€5 or 1% mismatch is OK).
  4. Confidence routing — high confidence + matched PO → auto-post. Medium → review queue. Low → reject with structured reason.
  5. Posting — agent posts to NetSuite, attaches the original PDF, marks the email as processed.

Outcome: 87% auto-post rate after tuning, 60% reduction in AP time spent on invoice keying. Read the full breakdown in the document intake case study.

Voice concierge — service business

A boutique clinic was losing after-hours bookings to voicemail. What we shipped:

  1. Twilio number routes to a GPT-4o Realtime model.
  2. The agent has function-calling tools: findAvailability(date, durationMinutes), bookAppointment(slotId, patientInfo), transferToHuman(reason).
  3. The agent books the slot, writes the lead into HubSpot, and texts a confirmation.
  4. Recording and full transcript are stored in Firestore for every call.

Outcome: ~80% of after-hours calls now convert to bookings (previously near-zero). Read Voice Concierge.

What we will not do

A short list of things we have learned, painfully, to refuse.

  • Build an agent without evals. It is shipping a car without seatbelts.
  • Promise specific accuracy numbers before the prototype. Anyone who quotes "99% accurate" pre-prototype is bluffing.
  • Use a single closed-source model with no fallback. Every agent has at least one provider escape hatch.
  • Skip the human-in-the-loop on irreversible actions. Money moves, contracts sign, emails send. All gated.

Frequently asked questions

See the FAQ section below — or jump to the agent type taxonomy if you want to see the six concrete shapes we build, with examples and costs per shape.

If you have a workflow in mind and want a fast take on whether an agent is the right shape, send a short note. We reply within one business day.

Frequently asked questions

Related work

Ready to scope ai agents development?

A discovery call is the fastest way to know if there's a fit.