Building a phone agent with Twilio + GPT-4o: a complete walkthrough
Architecture overview
The pipeline:
[Caller dials Twilio number]
↓
[Twilio answers, opens audio stream via WebSocket to your service]
↓
[Node.js bridge on Cloud Run]
├── Receives Twilio audio chunks
├── Maintains WebSocket to OpenAI GPT-4o Realtime
├── Pipes audio in both directions
├── Routes function calls to your tools
└── Logs everything to Firestore + Langfuse
↓
[GPT-4o Realtime model]
├── Listens to caller audio
├── Generates response audio (TTS-equivalent in one model)
├── Decides when to call tools
└── Decides when to end the call
↓
[Your tools]
├── findAvailability(date, durationMinutes)
├── bookAppointment(slotId, customerInfo)
├── transferToHuman(reason)
├── logLead(name, contact, qualifyingInfo)
└── logCallOutcome(outcome, notes)
↓
[Side effects]
├── Calendar event created
├── CRM record written
├── SMS confirmation sent
├── Slack alert if escalation
└── Recording + transcript stored
Three production-critical details: the WebSocket bridge handles back-pressure correctly; tools are idempotent; every call is recorded.
Twilio side
Provision a phone number. Configure it to point at your service via a webhook (POST /voice/incoming).
In app/api/voice/incoming/route.ts:
import twilio from "twilio";
export async function POST(req: Request) {
const VoiceResponse = twilio.twiml.VoiceResponse;
const twiml = new VoiceResponse();
const connect = twiml.connect();
connect.stream({
url: `wss://${process.env.HOST}/voice/stream`,
});
return new Response(twiml.toString(), {
headers: { "Content-Type": "application/xml" },
});
}
When a call comes in, Twilio answers and opens a WebSocket to your /voice/stream endpoint. Audio chunks flow in both directions from this point on.
The bridge service
A Cloud Run service exposing a WebSocket endpoint. When Twilio connects, your service connects to OpenAI's Realtime API and pipes audio.
import { WebSocketServer } from "ws";
import { OpenAI } from "openai";
const wss = new WebSocketServer({ port: 8080 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
wss.on("connection", async (twilioWs) => {
const callId = crypto.randomUUID();
const log = createCallLogger(callId);
// Open OpenAI realtime session
const openaiWs = openai.beta.realtime.connect({
model: "gpt-4o-realtime-preview",
voice: "alloy",
instructions: SYSTEM_PROMPT,
tools: TOOLS,
});
// Pipe Twilio audio → OpenAI
twilioWs.on("message", (msg) => {
const data = JSON.parse(msg.toString());
if (data.event === "media") {
openaiWs.send({
type: "input_audio_buffer.append",
audio: data.media.payload,
});
}
});
// Pipe OpenAI audio → Twilio
openaiWs.on("message", (msg) => {
if (msg.type === "response.audio.delta") {
twilioWs.send(JSON.stringify({
event: "media",
media: { payload: msg.delta },
}));
}
if (msg.type === "response.function_call_arguments.done") {
// Tool call from the model
handleToolCall(msg, openaiWs, log);
}
});
// Cleanup on disconnect
twilioWs.on("close", async () => {
await persistCall(callId, log);
});
});
Real production code adds: structured logging, error recovery on intermittent disconnects, cost tracking, latency measurement at each hop, PII redaction.
Tool definitions
Tools are the interface between the model's reasoning and your business systems. Each tool is typed and idempotent.
const TOOLS = [
{
type: "function",
name: "findAvailability",
description: "Find open appointment slots on a given date",
parameters: {
type: "object",
properties: {
date: { type: "string", description: "ISO date" },
durationMinutes: { type: "number", default: 30 },
},
required: ["date"],
},
},
{
type: "function",
name: "bookAppointment",
description: "Book an appointment for the caller in an open slot",
parameters: {
type: "object",
properties: {
slotId: { type: "string" },
customer: {
type: "object",
properties: {
name: { type: "string" },
phone: { type: "string" },
email: { type: "string" },
notes: { type: "string" },
},
required: ["name"],
},
},
required: ["slotId", "customer"],
},
},
// ... more tools
];
Implementation:
async function handleToolCall(msg, openaiWs, log) {
const { name, arguments: argsRaw, call_id } = msg;
const args = JSON.parse(argsRaw);
log.toolCall({ name, args });
let result;
try {
result = await TOOL_IMPLEMENTATIONS[name](args);
} catch (err) {
result = { error: String(err) };
log.toolError({ name, err });
}
openaiWs.send({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id,
output: JSON.stringify(result),
},
});
openaiWs.send({ type: "response.create" });
}
Each tool implementation in TOOL_IMPLEMENTATIONS is an idempotent function with its own logging.
The system prompt
The model's persona, scope, and rules of engagement live in the system prompt. Keep it short, specific, and unambiguous.
You are the AI assistant for [Practice Name]. You answer phone calls
to help callers book, reschedule, cancel, or get directions.
Always:
- Disclose at the start of the call that you are an AI assistant.
- Keep responses brief and conversational, not formal.
- Read back time/date/name before booking an appointment.
- Offer to transfer to a human if the caller asks or if you're unsure.
Never:
- Provide medical advice.
- Discuss billing disputes or insurance disputes — transfer.
- Make jokes about delays, prices, or sensitive topics.
- Pretend you are human.
Tools available:
- findAvailability(date, durationMinutes): query open slots
- bookAppointment(slotId, customer): book the slot
- transferToHuman(reason): warm transfer during business hours,
callback queue outside
- logCallOutcome(outcome, notes): always call this at the end
Business hours: Mon-Fri 9:00-18:00 CET. Outside hours, queue callbacks.
Real prompts are longer, more specific, and tuned based on eval results.
Recording and observability
Twilio records the call by default; you choose where it lands. We typically upload to the client's Cloud Storage with a 90-day retention policy and immediate transcript generation.
twilioWs.on("close", async () => {
const recording = await fetchTwilioRecording(callSid);
await uploadToGcs(`calls/${callId}/audio.mp3`, recording);
const transcript = await generateTranscript(recording);
await firestore.collection("calls").doc(callId).set({
callerNumber: maskPii(callerNumber),
durationMs,
transcript,
audioGcsPath: `gs://.../calls/${callId}/audio.mp3`,
outcome: derivedOutcome,
toolCalls: log.toolCalls,
cost: log.totalCost,
createdAt: serverTimestamp(),
});
await langfuseClient.trace({...});
});
Plus a dashboard showing per-day call volume, outcome distribution, cost per call, latency p50/p95, and a search/filter for past calls.
The eval harness
Voice agents drift more than text agents because real conversations vary more than real text inputs. Build the eval harness in week one of the build.
// scripts/eval-voice.ts
const fixtures = await loadCallFixtures(); // 30-50 real anonymised transcripts
for (const fixture of fixtures) {
const simulated = await simulateCall({
inputs: fixture.callerTurns,
systemPrompt: CURRENT_SYSTEM_PROMPT,
tools: TOOLS,
});
const grade = await gradeOutcome(simulated, fixture.expectedOutcome);
reportGrade(fixture.id, grade);
}
Run on every prompt change, every model upgrade, every tool change. Numbers go on the dashboard; regressions block deploy.
Cost economics
| Layer | Cost per minute |
|---|---|
| Twilio Voice (inbound) | ~€0.015 |
| GPT-4o Realtime | ~€0.10 (covers both directions) |
| Tool calls | variable, usually €0.01-€0.05 per call |
| Storage / observability | negligible |
A 90-second booking call: ~€0.12. A 5-minute support call: ~€0.40. We surface per-call cost on the dashboard so trends are visible.
What we won't do
- Build a voice agent without recording. Cannot evaluate, cannot tune.
- Build without a warm transfer path to humans.
- Bypass jurisdictional recording-disclosure rules.
- Pretend the AI is human.
Where to go next
For the buyer's perspective, see our Voice AI buyer's guide. For the full case study of a deployment, see Voice Concierge. For our Voice & Phone Agents service page covering engagement and pricing.
If you have a call surface you'd like to automate, drop us a note. One paragraph is enough.
Frequently asked questions
Keep reading
Voice AI for service businesses: a buyer's guide
Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.
How AI agents actually work (under the hood)
An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.
The state of AI development in 2026
In 2026, AI development is shipping production agents that earn their keep — document processing, voice, workflow orchestration — backed by Claude / GPT / Gemini and engineered with evals, observability, and guardrails. What's underrated: well-engineered automation with one or two LLM-judgment steps. What's overrated: 'autonomous AGI' marketing.
Voice & Phone AI Agents
AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.
Voice & Phone Agent
After-hours bookings, lead qualification, customer service overflow, FAQ lines
Want this delivered in your stack?
If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.