Hiring

Hiring an AI development agency: 12 questions to ask

· updated May 21, 20265 min read

The 12 questions

Before you sign with any AI development agency in 2026, get clear answers to these. Vague or evasive answers are themselves signal.

1. "What's in your eval suite, and can I see how scores trend?"

A serious agency runs evals on every prompt and model change. They should be able to show you:

  • A sample eval set from a previous engagement (anonymised).
  • The CI integration that runs evals.
  • A dashboard showing score trends over time.

If the answer is "we test before deploy" without specifics, the eval discipline doesn't exist.

2. "What does your observability stack look like?"

For an AI-touched workflow, observability means per-trace logging — every input, tool call, model decision, output. They should show you (or describe):

  • The trace UI they use (Langfuse, Braintrust, or equivalent).
  • How they alert on regressions.
  • Cost attribution per trace.

"We have logs" is not observability.

3. "Who owns the code and where does it live?"

The right answer: "Your git repository, your cloud, from day one." Variations should make you suspicious:

  • "We host it on our infra and you get an API." (Vendor lock-in.)
  • "We push to your repo at end of engagement." (Knowledge stays with them.)
  • "We use our proprietary framework." (Migration cost when you leave.)

4. "What's your provider abstraction story?"

Modern agents should not be locked to a single LLM vendor. Ask: "If Anthropic doubles their prices tomorrow, how long does it take you to swap to OpenAI?" The right answer is "hours, not weeks" — because the agent calls a thin provider interface, not the vendor SDK directly.

5. "When do you tell clients an AI agent is the wrong answer?"

A serious shop has said no to engagements. They should be able to give you a concrete example of when they recommended a client not build an AI agent. If every problem looks like a nail to them, they're selling hammers.

6. "How do you handle approval gates on irreversible actions?"

For any agent that touches money, sends external communications, signs contracts, or makes write operations on production systems, ask how the approval pattern works. The right answer involves a Slack-approval state or similar — the agent does 99% of the work, a human approves the last mile.

7. "What does handover look like?"

Code in your repo is necessary but not sufficient. Ask about:

  • Runbooks for common failure modes.
  • Eval suite handover so your team can run them.
  • On-call rotation patterns.
  • CI/CD configured for your team's preferences.
  • Direct contact during a defined warranty period post-launch.

If handover is an afterthought, you'll feel it three months in.

8. "Three references — ideally ones you won't volunteer."

Their first three references are best-case stories. Press for one more that wasn't on the list. Ask the references:

  • What went wrong during the build?
  • How did the agency handle scope creep?
  • Did the system survive after launch?
  • Would you hire them again?

9. "How do you handle data privacy and residency?"

Look for explicit answers about:

  • Zero-retention API contracts with LLM providers.
  • Data residency options (EU-only, in-region).
  • NDAs and DPAs as standard practice.
  • Cloud account ownership (your cloud, not theirs).

If the answer is "we're SOC 2 compliant" without specifics about where your data goes, dig deeper.

10. "What's your pricing structure and what's NOT included?"

The right answer is detailed:

  • Discovery: fixed price (€X).
  • Build: fixed price after discovery, with explicit scope.
  • Retainer: monthly with clear scope.
  • LLM/infra costs: pass-through, you set caps.
  • Out-of-scope changes: T&M after agreement.

Vague pricing means surprise invoices.

11. "Show me an engineer who'll work on my project."

Not the founder. Not the sales engineer. The person whose Slack messages you'll see day-to-day. Ask them a technical question. If the answer is good, the team is good. If the person isn't available for a 20-minute call before signing, the team you'll get isn't who's selling.

12. "What will you refuse to do?"

Every serious shop has things they won't do. Ours include: build an agent without evals, skip approval gates on irreversible actions, hide cost structure behind opaque licensing, use a single provider with no fallback.

If a shop will do anything the client wants, they have no opinions. Opinionated shops build better systems.

Red flags

In addition to the questions, watch for:

  • "We'll figure out the data flows during implementation." No they won't — they'll improvise, badly.
  • "This will be 99% accurate." Anyone quoting specific accuracy pre-prototype is bluffing.
  • No engineer on the call. The team that builds isn't the team that sells.
  • Pressure to sign during the call. Real engagements survive a week of review.
  • "AI" before "engineering" in their pitch. The hard parts are engineering; the AI is one component.
  • No working examples. Demos of generic LLM use don't count. Ask for case studies of agents in production.

Green flags

The signals that suggest a serious shop:

  • They ask more questions about your workflow than you ask them about their tech.
  • They mention failure modes before successes.
  • They have opinions about what NOT to do.
  • They want a paid discovery sprint before any build quote.
  • They mention evals, observability, and code ownership unprompted.
  • They've turned down engagements before.

A final filter

Read three pages of the agency's blog. If the content is generic AI puffery, the work probably is too. If it's specific, opinionated, with concrete examples and acknowledged trade-offs, the work probably is too.

For our take on what we'll and won't do, see The AI Development playbook. For how we price, see How much does an AI agent cost.

If you want to apply this checklist to us, drop us a note. We'll answer all 12 questions honestly in the first reply.

Frequently asked questions

Keep reading

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.