Voice AI

Building a phone agent with Twilio + GPT-4o: a complete walkthrough

· updated May 21, 20266 min read

Architecture overview

The pipeline:

[Caller dials Twilio number]
         ↓
[Twilio answers, opens audio stream via WebSocket to your service]
         ↓
[Node.js bridge on Cloud Run]
   ├── Receives Twilio audio chunks
   ├── Maintains WebSocket to OpenAI GPT-4o Realtime
   ├── Pipes audio in both directions
   ├── Routes function calls to your tools
   └── Logs everything to Firestore + Langfuse
         ↓
[GPT-4o Realtime model]
   ├── Listens to caller audio
   ├── Generates response audio (TTS-equivalent in one model)
   ├── Decides when to call tools
   └── Decides when to end the call
         ↓
[Your tools]
   ├── findAvailability(date, durationMinutes)
   ├── bookAppointment(slotId, customerInfo)
   ├── transferToHuman(reason)
   ├── logLead(name, contact, qualifyingInfo)
   └── logCallOutcome(outcome, notes)
         ↓
[Side effects]
   ├── Calendar event created
   ├── CRM record written
   ├── SMS confirmation sent
   ├── Slack alert if escalation
   └── Recording + transcript stored

Three production-critical details: the WebSocket bridge handles back-pressure correctly; tools are idempotent; every call is recorded.

Twilio side

Provision a phone number. Configure it to point at your service via a webhook (POST /voice/incoming).

In app/api/voice/incoming/route.ts:

import twilio from "twilio";

export async function POST(req: Request) {
  const VoiceResponse = twilio.twiml.VoiceResponse;
  const twiml = new VoiceResponse();
  const connect = twiml.connect();
  connect.stream({
    url: `wss://${process.env.HOST}/voice/stream`,
  });
  return new Response(twiml.toString(), {
    headers: { "Content-Type": "application/xml" },
  });
}

When a call comes in, Twilio answers and opens a WebSocket to your /voice/stream endpoint. Audio chunks flow in both directions from this point on.

The bridge service

A Cloud Run service exposing a WebSocket endpoint. When Twilio connects, your service connects to OpenAI's Realtime API and pipes audio.

import { WebSocketServer } from "ws";
import { OpenAI } from "openai";

const wss = new WebSocketServer({ port: 8080 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

wss.on("connection", async (twilioWs) => {
  const callId = crypto.randomUUID();
  const log = createCallLogger(callId);

  // Open OpenAI realtime session
  const openaiWs = openai.beta.realtime.connect({
    model: "gpt-4o-realtime-preview",
    voice: "alloy",
    instructions: SYSTEM_PROMPT,
    tools: TOOLS,
  });

  // Pipe Twilio audio → OpenAI
  twilioWs.on("message", (msg) => {
    const data = JSON.parse(msg.toString());
    if (data.event === "media") {
      openaiWs.send({
        type: "input_audio_buffer.append",
        audio: data.media.payload,
      });
    }
  });

  // Pipe OpenAI audio → Twilio
  openaiWs.on("message", (msg) => {
    if (msg.type === "response.audio.delta") {
      twilioWs.send(JSON.stringify({
        event: "media",
        media: { payload: msg.delta },
      }));
    }
    if (msg.type === "response.function_call_arguments.done") {
      // Tool call from the model
      handleToolCall(msg, openaiWs, log);
    }
  });

  // Cleanup on disconnect
  twilioWs.on("close", async () => {
    await persistCall(callId, log);
  });
});

Real production code adds: structured logging, error recovery on intermittent disconnects, cost tracking, latency measurement at each hop, PII redaction.

Tool definitions

Tools are the interface between the model's reasoning and your business systems. Each tool is typed and idempotent.

const TOOLS = [
  {
    type: "function",
    name: "findAvailability",
    description: "Find open appointment slots on a given date",
    parameters: {
      type: "object",
      properties: {
        date: { type: "string", description: "ISO date" },
        durationMinutes: { type: "number", default: 30 },
      },
      required: ["date"],
    },
  },
  {
    type: "function",
    name: "bookAppointment",
    description: "Book an appointment for the caller in an open slot",
    parameters: {
      type: "object",
      properties: {
        slotId: { type: "string" },
        customer: {
          type: "object",
          properties: {
            name: { type: "string" },
            phone: { type: "string" },
            email: { type: "string" },
            notes: { type: "string" },
          },
          required: ["name"],
        },
      },
      required: ["slotId", "customer"],
    },
  },
  // ... more tools
];

Implementation:

async function handleToolCall(msg, openaiWs, log) {
  const { name, arguments: argsRaw, call_id } = msg;
  const args = JSON.parse(argsRaw);
  log.toolCall({ name, args });

  let result;
  try {
    result = await TOOL_IMPLEMENTATIONS[name](args);
  } catch (err) {
    result = { error: String(err) };
    log.toolError({ name, err });
  }

  openaiWs.send({
    type: "conversation.item.create",
    item: {
      type: "function_call_output",
      call_id,
      output: JSON.stringify(result),
    },
  });
  openaiWs.send({ type: "response.create" });
}

Each tool implementation in TOOL_IMPLEMENTATIONS is an idempotent function with its own logging.

The system prompt

The model's persona, scope, and rules of engagement live in the system prompt. Keep it short, specific, and unambiguous.

You are the AI assistant for [Practice Name]. You answer phone calls
to help callers book, reschedule, cancel, or get directions.

Always:
- Disclose at the start of the call that you are an AI assistant.
- Keep responses brief and conversational, not formal.
- Read back time/date/name before booking an appointment.
- Offer to transfer to a human if the caller asks or if you're unsure.

Never:
- Provide medical advice.
- Discuss billing disputes or insurance disputes — transfer.
- Make jokes about delays, prices, or sensitive topics.
- Pretend you are human.

Tools available:
- findAvailability(date, durationMinutes): query open slots
- bookAppointment(slotId, customer): book the slot
- transferToHuman(reason): warm transfer during business hours,
  callback queue outside
- logCallOutcome(outcome, notes): always call this at the end

Business hours: Mon-Fri 9:00-18:00 CET. Outside hours, queue callbacks.

Real prompts are longer, more specific, and tuned based on eval results.

Recording and observability

Twilio records the call by default; you choose where it lands. We typically upload to the client's Cloud Storage with a 90-day retention policy and immediate transcript generation.

twilioWs.on("close", async () => {
  const recording = await fetchTwilioRecording(callSid);
  await uploadToGcs(`calls/${callId}/audio.mp3`, recording);
  const transcript = await generateTranscript(recording);
  await firestore.collection("calls").doc(callId).set({
    callerNumber: maskPii(callerNumber),
    durationMs,
    transcript,
    audioGcsPath: `gs://.../calls/${callId}/audio.mp3`,
    outcome: derivedOutcome,
    toolCalls: log.toolCalls,
    cost: log.totalCost,
    createdAt: serverTimestamp(),
  });
  await langfuseClient.trace({...});
});

Plus a dashboard showing per-day call volume, outcome distribution, cost per call, latency p50/p95, and a search/filter for past calls.

The eval harness

Voice agents drift more than text agents because real conversations vary more than real text inputs. Build the eval harness in week one of the build.

// scripts/eval-voice.ts
const fixtures = await loadCallFixtures(); // 30-50 real anonymised transcripts

for (const fixture of fixtures) {
  const simulated = await simulateCall({
    inputs: fixture.callerTurns,
    systemPrompt: CURRENT_SYSTEM_PROMPT,
    tools: TOOLS,
  });
  const grade = await gradeOutcome(simulated, fixture.expectedOutcome);
  reportGrade(fixture.id, grade);
}

Run on every prompt change, every model upgrade, every tool change. Numbers go on the dashboard; regressions block deploy.

Cost economics

LayerCost per minute
Twilio Voice (inbound)~€0.015
GPT-4o Realtime~€0.10 (covers both directions)
Tool callsvariable, usually €0.01-€0.05 per call
Storage / observabilitynegligible

A 90-second booking call: ~€0.12. A 5-minute support call: ~€0.40. We surface per-call cost on the dashboard so trends are visible.

What we won't do

  • Build a voice agent without recording. Cannot evaluate, cannot tune.
  • Build without a warm transfer path to humans.
  • Bypass jurisdictional recording-disclosure rules.
  • Pretend the AI is human.

Where to go next

For the buyer's perspective, see our Voice AI buyer's guide. For the full case study of a deployment, see Voice Concierge. For our Voice & Phone Agents service page covering engagement and pricing.

If you have a call surface you'd like to automate, drop us a note. One paragraph is enough.

Frequently asked questions

Keep reading

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.