Why GPT-4o Realtime over Whisper + TTS?

GPT-4o Realtime is one model that listens, thinks, and speaks — sub-2 second response, proper barge-in, contextual reasoning all in one call. Whisper + TTS is two-step: speech-to-text → LLM → text-to-speech. Higher latency, no barge-in, more moving parts. Use Realtime for conversational agents, Whisper + TTS for batch transcription jobs.

Can it handle accents?

Yes — GPT-4o Realtime handles a wide variety of accents and dialects well. We test against real call recordings from the client's customer base before deployment. The harder problems are usually background noise and overlapping speech, both of which we tune for during the build.

End-to-end (caller speech ends → agent first word) typically lands at 800ms–1.5s in our deployments. That's well below the threshold where calls start feeling unnatural. Cloud Run at the right region and proper WebSocket handling get you to under 2s reliably.

How do you handle PII (personal information) in transcripts?

Two layers. (1) Recordings and transcripts stored in your cloud, not ours. Retention policy you set. (2) Optional PII redaction layer — credit card numbers, social security, dates of birth detected and masked in stored transcripts. Configurable per use case and per region.

Why GPT-4o Realtime over Whisper + TTS?

GPT-4o Realtime is one model that listens, thinks, and speaks — sub-2 second response, proper barge-in, contextual reasoning all in one call. Whisper + TTS is two-step: speech-to-text → LLM → text-to-speech. Higher latency, no barge-in, more moving parts. Use Realtime for conversational agents, Whisper + TTS for batch transcription jobs.

Can it handle accents?

Yes — GPT-4o Realtime handles a wide variety of accents and dialects well. We test against real call recordings from the client's customer base before deployment. The harder problems are usually background noise and overlapping speech, both of which we tune for during the build.

End-to-end (caller speech ends → agent first word) typically lands at 800ms–1.5s in our deployments. That's well below the threshold where calls start feeling unnatural. Cloud Run at the right region and proper WebSocket handling get you to under 2s reliably.

How do you handle PII (personal information) in transcripts?

Two layers. (1) Recordings and transcripts stored in your cloud, not ours. Retention policy you set. (2) Optional PII redaction layer — credit card numbers, social security, dates of birth detected and masked in stored transcripts. Configurable per use case and per region.

All resources

Voice AI

Building a phone agent with Twilio + GPT-4o: a complete walkthrough

May 6, 2026· updated May 21, 20266 min read

Architecture overview

The pipeline:

[Caller dials Twilio number]
         ↓
[Twilio answers, opens audio stream via WebSocket to your service]
         ↓
[Node.js bridge on Cloud Run]
   ├── Receives Twilio audio chunks
   ├── Maintains WebSocket to OpenAI GPT-4o Realtime
   ├── Pipes audio in both directions
   ├── Routes function calls to your tools
   └── Logs everything to Firestore + Langfuse
         ↓
[GPT-4o Realtime model]
   ├── Listens to caller audio
   ├── Generates response audio (TTS-equivalent in one model)
   ├── Decides when to call tools
   └── Decides when to end the call
         ↓
[Your tools]
   ├── findAvailability(date, durationMinutes)
   ├── bookAppointment(slotId, customerInfo)
   ├── transferToHuman(reason)
   ├── logLead(name, contact, qualifyingInfo)
   └── logCallOutcome(outcome, notes)
         ↓
[Side effects]
   ├── Calendar event created
   ├── CRM record written
   ├── SMS confirmation sent
   ├── Slack alert if escalation
   └── Recording + transcript stored

Three production-critical details: the WebSocket bridge handles back-pressure correctly; tools are idempotent; every call is recorded.

Twilio side

Provision a phone number. Configure it to point at your service via a webhook (POST /voice/incoming).

In app/api/voice/incoming/route.ts:

import twilio from "twilio";

export async function POST(req: Request) {
  const VoiceResponse = twilio.twiml.VoiceResponse;
  const twiml = new VoiceResponse();
  const connect = twiml.connect();
  connect.stream({
    url: `wss://${process.env.HOST}/voice/stream`,
  });
  return new Response(twiml.toString(), {
    headers: { "Content-Type": "application/xml" },
  });
}

When a call comes in, Twilio answers and opens a WebSocket to your /voice/stream endpoint. Audio chunks flow in both directions from this point on.

The bridge service

A Cloud Run service exposing a WebSocket endpoint. When Twilio connects, your service connects to OpenAI's Realtime API and pipes audio.

import { WebSocketServer } from "ws";
import { OpenAI } from "openai";

const wss = new WebSocketServer({ port: 8080 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

wss.on("connection", async (twilioWs) => {
  const callId = crypto.randomUUID();
  const log = createCallLogger(callId);

  // Open OpenAI realtime session
  const openaiWs = openai.beta.realtime.connect({
    model: "gpt-4o-realtime-preview",
    voice: "alloy",
    instructions: SYSTEM_PROMPT,
    tools: TOOLS,
  });

  // Pipe Twilio audio → OpenAI
  twilioWs.on("message", (msg) => {
    const data = JSON.parse(msg.toString());
    if (data.event === "media") {
      openaiWs.send({
        type: "input_audio_buffer.append",
        audio: data.media.payload,
      });
    }
  });

  // Pipe OpenAI audio → Twilio
  openaiWs.on("message", (msg) => {
    if (msg.type === "response.audio.delta") {
      twilioWs.send(JSON.stringify({
        event: "media",
        media: { payload: msg.delta },
      }));
    }
    if (msg.type === "response.function_call_arguments.done") {
      // Tool call from the model
      handleToolCall(msg, openaiWs, log);
    }
  });

  // Cleanup on disconnect
  twilioWs.on("close", async () => {
    await persistCall(callId, log);
  });
});

Real production code adds: structured logging, error recovery on intermittent disconnects, cost tracking, latency measurement at each hop, PII redaction.

Tool definitions

Tools are the interface between the model's reasoning and your business systems. Each tool is typed and idempotent.

const TOOLS = [
  {
    type: "function",
    name: "findAvailability",
    description: "Find open appointment slots on a given date",
    parameters: {
      type: "object",
      properties: {
        date: { type: "string", description: "ISO date" },
        durationMinutes: { type: "number", default: 30 },
      },
      required: ["date"],
    },
  },
  {
    type: "function",
    name: "bookAppointment",
    description: "Book an appointment for the caller in an open slot",
    parameters: {
      type: "object",
      properties: {
        slotId: { type: "string" },
        customer: {
          type: "object",
          properties: {
            name: { type: "string" },
            phone: { type: "string" },
            email: { type: "string" },
            notes: { type: "string" },
          },
          required: ["name"],
        },
      },
      required: ["slotId", "customer"],
    },
  },
  // ... more tools
];

Implementation:

async function handleToolCall(msg, openaiWs, log) {
  const { name, arguments: argsRaw, call_id } = msg;
  const args = JSON.parse(argsRaw);
  log.toolCall({ name, args });

  let result;
  try {
    result = await TOOL_IMPLEMENTATIONS[name](args);
  } catch (err) {
    result = { error: String(err) };
    log.toolError({ name, err });
  }

  openaiWs.send({
    type: "conversation.item.create",
    item: {
      type: "function_call_output",
      call_id,
      output: JSON.stringify(result),
    },
  });
  openaiWs.send({ type: "response.create" });
}

Each tool implementation in TOOL_IMPLEMENTATIONS is an idempotent function with its own logging.

The system prompt

The model's persona, scope, and rules of engagement live in the system prompt. Keep it short, specific, and unambiguous.

You are the AI assistant for [Practice Name]. You answer phone calls
to help callers book, reschedule, cancel, or get directions.

Always:
- Disclose at the start of the call that you are an AI assistant.
- Keep responses brief and conversational, not formal.
- Read back time/date/name before booking an appointment.
- Offer to transfer to a human if the caller asks or if you're unsure.

Never:
- Provide medical advice.
- Discuss billing disputes or insurance disputes — transfer.
- Make jokes about delays, prices, or sensitive topics.
- Pretend you are human.

Tools available:
- findAvailability(date, durationMinutes): query open slots
- bookAppointment(slotId, customer): book the slot
- transferToHuman(reason): warm transfer during business hours,
  callback queue outside
- logCallOutcome(outcome, notes): always call this at the end

Business hours: Mon-Fri 9:00-18:00 CET. Outside hours, queue callbacks.

Real prompts are longer, more specific, and tuned based on eval results.

Recording and observability

Twilio records the call by default; you choose where it lands. We typically upload to the client's Cloud Storage with a 90-day retention policy and immediate transcript generation.

twilioWs.on("close", async () => {
  const recording = await fetchTwilioRecording(callSid);
  await uploadToGcs(`calls/${callId}/audio.mp3`, recording);
  const transcript = await generateTranscript(recording);
  await firestore.collection("calls").doc(callId).set({
    callerNumber: maskPii(callerNumber),
    durationMs,
    transcript,
    audioGcsPath: `gs://.../calls/${callId}/audio.mp3`,
    outcome: derivedOutcome,
    toolCalls: log.toolCalls,
    cost: log.totalCost,
    createdAt: serverTimestamp(),
  });
  await langfuseClient.trace({...});
});

Plus a dashboard showing per-day call volume, outcome distribution, cost per call, latency p50/p95, and a search/filter for past calls.

The eval harness

Voice agents drift more than text agents because real conversations vary more than real text inputs. Build the eval harness in week one of the build.

// scripts/eval-voice.ts
const fixtures = await loadCallFixtures(); // 30-50 real anonymised transcripts

for (const fixture of fixtures) {
  const simulated = await simulateCall({
    inputs: fixture.callerTurns,
    systemPrompt: CURRENT_SYSTEM_PROMPT,
    tools: TOOLS,
  });
  const grade = await gradeOutcome(simulated, fixture.expectedOutcome);
  reportGrade(fixture.id, grade);
}

Run on every prompt change, every model upgrade, every tool change. Numbers go on the dashboard; regressions block deploy.

Cost economics

Layer	Cost per minute
Twilio Voice (inbound)	~€0.015
GPT-4o Realtime	~€0.10 (covers both directions)
Tool calls	variable, usually €0.01-€0.05 per call
Storage / observability	negligible

A 90-second booking call: ~€0.12. A 5-minute support call: ~€0.40. We surface per-call cost on the dashboard so trends are visible.

What we won't do

Build a voice agent without recording. Cannot evaluate, cannot tune.
Build without a warm transfer path to humans.
Bypass jurisdictional recording-disclosure rules.
Pretend the AI is human.

Where to go next

For the buyer's perspective, see our Voice AI buyer's guide. For the full case study of a deployment, see Voice Concierge. For our Voice & Phone Agents service page covering engagement and pricing.

If you have a call surface you'd like to automate, drop us a note. One paragraph is enough.

Frequently asked questions

Keep reading

Article

Voice AI for service businesses: a buyer's guide

Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.

Article

How AI agents actually work (under the hood)

An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.

Article

The state of AI development in 2026

In 2026, AI development is shipping production agents that earn their keep — document processing, voice, workflow orchestration — backed by Claude / GPT / Gemini and engineered with evals, observability, and guardrails. What's underrated: well-engineered automation with one or two LLM-judgment steps. What's overrated: 'autonomous AGI' marketing.

Service

Voice & Phone AI Agents

AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.

Agent type

Voice & Phone Agent

After-hours bookings, lead qualification, customer service overflow, FAQ lines

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.

Get a proposal

Frequently asked questions

Why GPT-4o Realtime over Whisper + TTS?

Can it handle accents?

What about latency?

How do you handle PII (personal information) in transcripts?

Keep reading

Voice AI for service businesses: a buyer's guide

How AI agents actually work (under the hood)

The state of AI development in 2026

Voice & Phone AI Agents

Voice & Phone Agent

Want this delivered in your stack?