agents

Voice & Phone AI Agents

AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.

What a voice agent actually does

You give it a phone number. It picks up calls. It has a conversation with the caller — listening, speaking, interrupting cleanly, recovering from confusion. It executes the tools you give it (find calendar availability, book a slot, look up an order, qualify a lead, transfer to a human). It hangs up. The transcript and recording land in your system.

That is the whole product. The interesting parts are everything around it — the voice picked, the pacing, the failure modes, the integrations, the observability — but the surface is exactly that simple.

For a deeper walkthrough of how we built one from scratch, see our Twilio + GPT-4o walkthrough post.

When a voice agent is the right call

You should consider a voice agent when:

  1. You are losing calls. After-hours, peak hours, vacation cover, unanswered inbound. Voicemail-to-callback has a brutal drop-off rate.
  2. The conversation has shape. Booking, qualification, order status, FAQ — calls with predictable structure where the human work is mostly transactional.
  3. Volume justifies the build. Roughly 100+ calls per month at the bottom end. Below that, a smart IVR or a virtual assistant is cheaper.
  4. You can record the call. Most jurisdictions require disclosure. If you cannot record, we cannot evaluate, and the agent will silently drift.

You should not use a voice agent when:

  • The conversation is high-empathy (medical triage, mental health, complaint resolution). Use it as a router to a human, not as the conversation itself.
  • Legal compliance forbids automation (e.g. some debt collection laws). Check first.
  • Your customer base will react badly. Some markets and demographics still strongly prefer humans on the phone. We will tell you if we think your audience won't tolerate it.

Our Voice AI buyer's guide goes deeper on the buy/build decision.

The stack

LayerDefault choice
TelephonyTwilio Voice (Stream API)
Realtime modelOpenAI GPT-4o Realtime
ASR fallback (if non-realtime)Whisper-large-v3
TTS fallbackOpenAI TTS / ElevenLabs
Function callingOpenAI tools / Anthropic tools
BackendNode.js on Cloud Run with WebSocket support
StateFirestore or Postgres
Recording / transcriptTwilio + your cloud storage
ObservabilityLangfuse + structured logs + Slack alerts

We use GPT-4o Realtime as the default because it gives us proper barge-in (the agent stops talking when the caller starts) and sub-second response times, which are the two things that make a call feel like a conversation rather than an interrogation.

For longer / slower workflows or when realtime isn't necessary, the Whisper + TTS pattern is cheaper and easier to reason about.

Anatomy of a production voice agent

A simple booking-line agent looks like this in code, abbreviated:

// On call connect
const session = await openai.realtime.connect({
  model: "gpt-4o-realtime-preview",
  voice: "alloy",
  instructions: SYSTEM_PROMPT,
  tools: [
    findAvailabilityTool,
    bookAppointmentTool,
    transferToHumanTool,
    logLeadTool,
  ],
});

twilio.stream.pipe(session.audio.in);
session.audio.out.pipe(twilio.playback);

session.on("tool_call", async (call) => {
  const result = await TOOLS[call.name].run(call.args);
  await session.tool_result(call.id, result);
});

session.on("end", async (transcript) => {
  await firestore.collection("calls").add({
    callerNumber: call.from,
    durationMs: call.duration,
    transcript,
    audioUrl: await uploadRecording(call.recordingUrl),
    outcome: extractOutcome(transcript),
    crmLogged: await syncToCRM(transcript),
    createdAt: serverTimestamp(),
  });
});

In production we add: retry logic on tool calls, timeout guardrails, profanity / harassment detection that escalates, structured logs, eval harness for replaying past calls against prompt changes, and a per-call cost attribution.

Process

1. Discovery — 3 to 5 days

We map the call surface area: typical intents, edge cases, current handoff to humans, success metric (booking rate, qualified-lead rate, resolved-without-escalation rate). We pull 20+ historic call recordings or transcripts where available and analyse the shape of real conversations. We propose the intent taxonomy.

2. Script + prompt design — 3 to 5 days

We write the system prompt, the tool descriptions, the opening message, and the escalation triggers. We define the voice (which TTS voice, pacing, persona). We define what's in scope and what is not in scope so the agent doesn't try to handle things it shouldn't.

3. Build — 1 to 3 weeks

Twilio number provisioning. WebSocket bridge to the model. Tool implementations. CRM integration. Recording / transcript pipeline. Observability dashboard. Test calls every day.

4. Eval + tuning — 1 week

We run the agent against your real historic call set (or simulated calls) and score it. We tune the prompt, the voice, the tools. We re-run. We do not deploy until the eval shows the agent meets the success metric.

5. Launch — phased

Single phone number → 10% of inbound traffic → 50% → 100%. Or a separate after-hours line first, with humans during business hours. Whatever lets us catch regressions early.

6. Iterate

Voice agents drift more than text agents because reality keeps changing the conversation. We keep a monthly eval cadence on retainer.

What good observability looks like for voice

A one-glance dashboard:

  • Calls per day, week, month
  • Outcome distribution (booked / qualified / transferred / failed)
  • Average call duration
  • Cost per call (Twilio + model + downstream tools)
  • Caller sentiment (sampled from transcripts)
  • Escalation rate over time (trending up = something to fix)
  • Latency p50 / p95 (time to first agent word, time-to-resolution)

Plus per-call drilldown with the full transcript, the audio playback, the tool calls made, and the outcome.

Pricing — the honest version

EngagementScopeInvestment
Discovery + scripted demo5–7 days€3,500–6,000
Single-intent booking line3–4 weeks€15,000–25,000
Multi-intent qualification line + CRM4–6 weeks€25,000–50,000
Multi-language or multi-line program6–10 weeks€50,000–100,000
Ongoing retainer (eval + tuning)Monthlyfrom €1,500/month

Plus pass-through Twilio + LLM costs (typically €0.10–€0.40 per call).

What we will not do

  • Build a voice agent without recording (we cannot evaluate, and the agent will silently degrade).
  • Build a voice agent for legally sensitive workflows without explicit legal sign-off from your team.
  • Promise specific resolution rates before the eval phase is done.
  • Ship without warm-transfer paths to humans.

If you have a call surface in mind — after-hours bookings, qualification, FAQ — send a note and we'll come back within one business day with a feasibility take.

Frequently asked questions

Related work

Ready to scope voice & phone ai agents?

A discovery call is the fastest way to know if there's a fit.