What if an agency can't answer these questions clearly?

It's a signal, not a kill criterion. Some shops are great builders but bad at sales conversations. But if multiple answers are vague or evasive — especially about evals, observability, or code ownership — walk. Those are non-negotiables in 2026 production AI.

How many references should I ask for?

Three minimum, and pick the ones the agency didn't volunteer. The references they spontaneously offered are their best-case stories. The ones you have to dig for are closer to typical. Ask the references about what went wrong, not just what went right.

Is it OK if an agency mostly does one type of agent?

Often yes. Specialisation usually means deeper expertise. The risk: if the agency's only shape is X, they may try to fit your problem into X even when Y is the right answer. Ask explicitly: 'when do you tell clients an agent is the wrong solution?'

What if an agency can't answer these questions clearly?

It's a signal, not a kill criterion. Some shops are great builders but bad at sales conversations. But if multiple answers are vague or evasive — especially about evals, observability, or code ownership — walk. Those are non-negotiables in 2026 production AI.

How many references should I ask for?

Three minimum, and pick the ones the agency didn't volunteer. The references they spontaneously offered are their best-case stories. The ones you have to dig for are closer to typical. Ask the references about what went wrong, not just what went right.

Is it OK if an agency mostly does one type of agent?

Often yes. Specialisation usually means deeper expertise. The risk: if the agency's only shape is X, they may try to fit your problem into X even when Y is the right answer. Ask explicitly: 'when do you tell clients an agent is the wrong solution?'

All resources

Hiring

Hiring an AI development agency: 12 questions to ask

May 10, 2026· updated May 21, 20265 min read

The 12 questions

Before you sign with any AI development agency in 2026, get clear answers to these. Vague or evasive answers are themselves signal.

1. "What's in your eval suite, and can I see how scores trend?"

A serious agency runs evals on every prompt and model change. They should be able to show you:

A sample eval set from a previous engagement (anonymised).
The CI integration that runs evals.
A dashboard showing score trends over time.

If the answer is "we test before deploy" without specifics, the eval discipline doesn't exist.

2. "What does your observability stack look like?"

For an AI-touched workflow, observability means per-trace logging — every input, tool call, model decision, output. They should show you (or describe):

The trace UI they use (Langfuse, Braintrust, or equivalent).
How they alert on regressions.
Cost attribution per trace.

"We have logs" is not observability.

3. "Who owns the code and where does it live?"

The right answer: "Your git repository, your cloud, from day one." Variations should make you suspicious:

"We host it on our infra and you get an API." (Vendor lock-in.)
"We push to your repo at end of engagement." (Knowledge stays with them.)
"We use our proprietary framework." (Migration cost when you leave.)

4. "What's your provider abstraction story?"

Modern agents should not be locked to a single LLM vendor. Ask: "If Anthropic doubles their prices tomorrow, how long does it take you to swap to OpenAI?" The right answer is "hours, not weeks" — because the agent calls a thin provider interface, not the vendor SDK directly.

5. "When do you tell clients an AI agent is the wrong answer?"

A serious shop has said no to engagements. They should be able to give you a concrete example of when they recommended a client not build an AI agent. If every problem looks like a nail to them, they're selling hammers.

6. "How do you handle approval gates on irreversible actions?"

For any agent that touches money, sends external communications, signs contracts, or makes write operations on production systems, ask how the approval pattern works. The right answer involves a Slack-approval state or similar — the agent does 99% of the work, a human approves the last mile.

7. "What does handover look like?"

Code in your repo is necessary but not sufficient. Ask about:

Runbooks for common failure modes.
Eval suite handover so your team can run them.
On-call rotation patterns.
CI/CD configured for your team's preferences.
Direct contact during a defined warranty period post-launch.

If handover is an afterthought, you'll feel it three months in.

8. "Three references — ideally ones you won't volunteer."

Their first three references are best-case stories. Press for one more that wasn't on the list. Ask the references:

What went wrong during the build?
How did the agency handle scope creep?
Did the system survive after launch?
Would you hire them again?

9. "How do you handle data privacy and residency?"

Look for explicit answers about:

Zero-retention API contracts with LLM providers.
Data residency options (EU-only, in-region).
NDAs and DPAs as standard practice.
Cloud account ownership (your cloud, not theirs).

If the answer is "we're SOC 2 compliant" without specifics about where your data goes, dig deeper.

10. "What's your pricing structure and what's NOT included?"

The right answer is detailed:

Discovery: fixed price (€X).
Build: fixed price after discovery, with explicit scope.
Retainer: monthly with clear scope.
LLM/infra costs: pass-through, you set caps.
Out-of-scope changes: T&M after agreement.

Vague pricing means surprise invoices.

11. "Show me an engineer who'll work on my project."

Not the founder. Not the sales engineer. The person whose Slack messages you'll see day-to-day. Ask them a technical question. If the answer is good, the team is good. If the person isn't available for a 20-minute call before signing, the team you'll get isn't who's selling.

12. "What will you refuse to do?"

Every serious shop has things they won't do. Ours include: build an agent without evals, skip approval gates on irreversible actions, hide cost structure behind opaque licensing, use a single provider with no fallback.

If a shop will do anything the client wants, they have no opinions. Opinionated shops build better systems.

Red flags

In addition to the questions, watch for:

"We'll figure out the data flows during implementation." No they won't — they'll improvise, badly.
"This will be 99% accurate." Anyone quoting specific accuracy pre-prototype is bluffing.
No engineer on the call. The team that builds isn't the team that sells.
Pressure to sign during the call. Real engagements survive a week of review.
"AI" before "engineering" in their pitch. The hard parts are engineering; the AI is one component.
No working examples. Demos of generic LLM use don't count. Ask for case studies of agents in production.

Green flags

The signals that suggest a serious shop:

They ask more questions about your workflow than you ask them about their tech.
They mention failure modes before successes.
They have opinions about what NOT to do.
They want a paid discovery sprint before any build quote.
They mention evals, observability, and code ownership unprompted.
They've turned down engagements before.

A final filter

Read three pages of the agency's blog. If the content is generic AI puffery, the work probably is too. If it's specific, opinionated, with concrete examples and acknowledged trade-offs, the work probably is too.

For our take on what we'll and won't do, see The AI Development playbook. For how we price, see How much does an AI agent cost.

If you want to apply this checklist to us, drop us a note. We'll answer all 12 questions honestly in the first reply.

Frequently asked questions

Keep reading

Article

How much does an AI agent cost? Real numbers from real builds

AI agent builds in 2026 typically cost €4-8k for discovery, €15-30k for a working prototype, €25-80k for production, €2-5k/month for retainer. Per-call infrastructure cost runs €0.01-€0.40 depending on shape. Honest numbers from real builds, with the trade-offs explained.

Article

What is an AI agent? The full breakdown

An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.

Article

AI agents vs automation: which one do you actually need?

Use plain automation when the rules are deterministic — same inputs, same outputs, no judgment required. Use AI agents when inputs are unstructured (PDFs, emails, voice) or each instance needs a decision. Most production systems mix both: automation moves the predictable steps, an agent handles the messy ones.

Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Service

Custom Development

Web apps, mobile apps, dashboards, internal tools. React, Next.js, React Native, Power Apps — picked for the job, not the hype.

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.

Get a proposal