Voice & Phone Agent
After-hours bookings, lead qualification, customer service overflow, FAQ lines
What a voice agent actually does
You give it a phone number. It picks up. It listens, speaks, interrupts cleanly, and recovers from confusion. It executes the tools you give it. It hangs up.
That's the whole surface. The interesting parts are everything around it — the model choice, the voice, the pacing, the tool implementations, the observability, the escalation paths — but the user-facing product is exactly that simple.
A typical 90-second booking call goes:
- Agent: "Thanks for calling [practice]. You're speaking with an AI assistant — I can help you book, reschedule, or get you to a human if you prefer. How can I help?"
- Caller: "I need to book a check-up for next Tuesday."
- Agent: [calls
findAvailability(2026-05-26, 30)] "I have 10am, 11:30am, or 2pm. Which works?" - Caller: "10am please."
- Agent: "Can I have your name and date of birth?"
- Caller: "[gives details]"
- Agent: [calls
bookAppointment(slotId, patient)] "Booked. You'll get a confirmation by SMS. Anything else?"
End: transcript and recording in Firestore, CRM updated, SMS sent, call counted in the dashboard.
When this agent is the right call
You should consider a voice agent when:
- You are losing calls. After-hours, peak hours, vacation cover, unanswered inbound. Voicemail-to-callback has brutal drop-off.
- The conversation has shape. Booking, qualification, order status, FAQ — predictable structure where the human work is mostly transactional.
- Volume justifies it. ~100+ relevant calls per month at the bottom end.
- You can record. Recording is non-negotiable for evaluation; if regulation forbids it, the agent will silently drift.
You should not use a voice agent when:
- High-empathy work (medical triage, mental health, complaint resolution). Use it as a router to a human, not as the conversation itself.
- Compliance forbids automation (some debt collection laws, some healthcare scenarios).
- Your audience hates phone bots categorically. (More common in some markets than others.)
Our Voice AI buyer's guide covers the decision in detail.
The stack
| Layer | Default |
|---|---|
| Telephony | Twilio Voice (Stream API for realtime audio) |
| Realtime model | OpenAI GPT-4o Realtime |
| Fallback ASR | Whisper-large-v3 |
| Fallback TTS | OpenAI TTS / ElevenLabs |
| Function calling | OpenAI tools |
| Backend | Node.js on Cloud Run with WebSocket |
| State | Firestore or Postgres |
| Recording | Twilio + your cloud storage |
| Observability | Langfuse + structured logs + Slack alerts |
Anatomy of a production voice agent
The minimum production voice agent has:
- System prompt that defines persona, scope, escalation triggers, and the voice's "rules of engagement."
- Tool definitions that map intents to concrete actions in your systems.
- Function-calling loop that lets the model invoke tools mid-conversation.
- Recording + transcription for every call, stored in your cloud.
- Eval harness that replays past calls against new prompts/models.
- Observability dashboard showing per-call cost, latency, outcomes, and trends.
- Escalation paths — warm transfer during business hours, callback queue outside.
A reference implementation in code lives in our Twilio + GPT-4o walkthrough post.
Cost economics
Roughly €0.10–€0.40 per call in production:
- Twilio voice: ~€0.015/minute
- GPT-4o Realtime: ~€0.10/minute conversation (counts both directions)
- Function-call tool latency / cost: variable
- Storage and observability: trivial
A typical 90-second booking call: ~€0.12. A 5-minute support call: ~€0.40. Surface this on the dashboard so you can correlate spend to outcomes.
Timeline
| Scope | Duration |
|---|---|
| Single-intent booking line | 3–4 weeks |
| Multi-intent qualification + CRM | 4–6 weeks |
| Multi-language or multi-line | 6–10 weeks |
| Enterprise voice platform (many lines, many integrations) | 10–16 weeks |
Common failure modes we've seen
- The voice sounds wrong for the brand. Default voices are too neutral or too enthusiastic. We tune voice + pacing during the build, not after deployment.
- The agent answers questions it shouldn't. A booking agent suddenly being asked medical advice. We define scope explicitly and add refusal patterns.
- Cold transfer to a human, who restarts from scratch. We always pass conversation context with the transfer.
- No recording. Cannot evaluate. Cannot tune. Cannot debug. The agent silently degrades over months. We refuse engagements without recording.
- Vendor lock-in to Twilio's "Studio" or similar low-code IVR builders. Easy to start, painful to scale. We default to code from day one.
Where it pairs
Voice agents commonly chain with:
- Conversational agents for the same knowledge surface served in chat.
- Workflow orchestrators that pick up after the call — send follow-up emails, schedule reminders, kick off downstream automations.
- Document processing agents when the caller references documents that need to be retrieved or validated mid-call.
See Voice Concierge for a full end-to-end build, or drop us a note with the call surface you'd like to automate.
Frequently asked questions
Related
Voice & Phone AI Agents
AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.
Voice Concierge
AI phone agent for after-hours bookings
Building a phone agent with Twilio + GPT-4o: a complete walkthrough
Build a phone agent: Twilio provisions the number and streams audio, a Node.js bridge on Cloud Run pipes the audio to GPT-4o Realtime, function-calling tools execute real actions (book appointment, log lead, transfer). Recording, transcript, and observability on every call. Production deployment in 3-6 weeks.
Voice AI for service businesses: a buyer's guide
Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.
Why your AI chatbot fails (and what to fix)
Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.
Want to scope a voice & phone agent project?
Tell us the workflow. We'll come back within one business day with a clear next step.