How to Build a Low Latency AI Phone Agent

The Latency Number That Actually Matters

When developers first build a voice agent, they measure latency wrong. They time the LLM call. Maybe they time the speech synthesis call too. Then they ship something that "feels fast in testing" and watch users hate it on a real phone line.

The number that matters is end-of-speech to start-of-speech. Caller stops talking. How long until they hear the first audible word back? Not the first token. Not the first audio packet sitting in a buffer. The first word their ear can actually parse.

Humans tolerate roughly 200ms of conversational gap before the silence starts to feel weird. Beyond about 800ms, the caller will start talking again, thinking the line dropped or the agent didn't hear them. The ITU-T G.114 recommendation on one-way transmission time pegs 150ms as the upper bound for "most user applications" before quality degrades, and that's just the network path. You have to fit everything else inside whatever's left.

The original IBM study on response time (Doherty, 1982) put the productive threshold for human-computer interaction at 400ms. Voice is harder because there's no visual feedback, no spinner, no "thinking..." indicator. Silence on a phone line is loud.

So your real latency budget for the entire pipeline is roughly 600 to 900ms if you want it to feel natural. Most teams blow past that on their first build and don't understand why.

The Typical Voice Agent Stack

Here's what almost every team builds first. It's the architecture you'll see in 80% of tutorials, conference talks, and starter repos.

```

Phone Carrier

↓ (SIP/PSTN)

Telephony Provider (Twilio, Telnyx, Plivo)

↓ (WebSocket: Media Streams)

Your Server

↓ (audio chunks)

Speech-to-Text Service

↓ (transcript)

Large Language Model

↓ (response text)

Text-to-Speech Service

↓ (audio chunks)

Your Server

↓ (back through WebSocket)

Telephony Provider

↓ (PSTN)

Caller's Phone

```

Six network hops. Three external API calls. Each one buffered, each one with its own jitter and tail latency. This is what's documented in Twilio's Media Streams guide and what frameworks like LiveKit Agents are built around.

It works. It's also slow.

Where the Milliseconds Actually Go

Let me decompose a single conversational turn so you can see where time disappears. These are realistic numbers from a US-based stack hitting US-East endpoints over a clean network:

Stage	What's Happening	Realistic Time
Caller stops speaking	Acoustic silence begins	0ms
VAD / endpointing detects end	STT decides the user is done	200-600ms
Final transcript returned	STT flushes its buffer	50-200ms
LLM time-to-first-token	First reasoning model token arrives	400-1500ms
Token accumulation	Wait for enough text to start TTS	100-300ms
TTS time-to-first-audio	First audio bytes arrive	200-500ms
Audio buffer warmup	Codec converts and queues for transmission	50-150ms
Network hop back to PSTN	Egress, codec, jitter buffer	100-300ms

Total: 1100ms on a fast day. 3550ms on a bad one.

That's why your demo "feels fine" on your laptop while your beta users complain that the agent feels robotic and slow on a real phone.

The killers, ranked:

1. Endpoint Detection

This is the silent budget killer. You can't start the LLM until you decide the caller stopped talking. Most STT services like Deepgram, AssemblyAI, and Speechmatics expose an endpointing parameter that defaults to roughly 500ms of silence. That's a 500ms tax on every single turn before any model has even started thinking.

If you tighten it, the agent interrupts callers mid-sentence. If you loosen it, conversations feel sluggish. Aggressive endpointing combined with a smart turn-detection model is the single biggest unlock most teams discover late.

2. LLM Time-to-First-Token

Frontier models are slow to start. OpenAI's GPT-realtime model is built for this exact problem and still has noticeable warmup. Anthropic's Claude is excellent at reasoning but its time-to-first-token on Sonnet-class models can sit in the 600-1200ms range depending on prompt size. Smaller models are faster but make worse phone agents.

The tradeoff is real and there's no clever way around it. You either pay the latency for a smart model or you pay the conversation-quality tax for a fast one.

3. TTS Latency

Older TTS services synthesize whole utterances and only return audio when generation finishes. A 12-word reply might take 600ms to render before the first byte ships. Streaming TTS APIs like ElevenLabs streaming emit audio chunks as text arrives, but you still wait for "enough" tokens before the synthesizer can start producing useful prosody.

4. Network Hops

Each external call adds the round-trip time of wherever your servers happen to be relative to the provider. Co-locate everything in the same region. Ideally the same availability zone. A US-West server hitting a US-East TTS provider eats 60-80ms per call for absolutely no reason.

Why Pipelines Break in Production

Even after you tune every component, the orchestration logic between them is where production agents fall apart. The hard problems aren't well documented in any tutorial:

Barge-in handling. When the caller starts talking while the agent is mid-sentence, you have to instantly stop TTS playback, drain the audio buffer on the telephony side, cancel the in-flight LLM call, and start listening again. Get this wrong and your agent talks over callers, which is the single fastest way to make people hate it.

Function-calling round-trips. When the LLM needs to call a tool (lookup customer, check availability, transfer call), that's another network hop on top of the conversational loop. Now your turn latency is LLM-1 + tool + LLM-2 + TTS, which can easily blow past 3 seconds.

Streaming partial transcripts. You can speculatively start the LLM on partial transcripts to win latency, but you have to handle the case where the speculation was wrong and the user actually said something different than the partial. This is a state machine nobody enjoys writing.

Filler words. Real receptionists say "let me check that for you" while looking something up. Adding this to a voice agent requires a mini state machine that runs in parallel with your tool calls and pre-emits audio. Otherwise the caller hears 4 seconds of dead silence.

Reconnect logic. WebSocket connections to telephony providers drop. Your STT connection drops. Your TTS connection drops. Each of these has to gracefully reconnect mid-call without dropping audio, without restarting the conversation, and without confusing the LLM about what was already said.

By the time you've handled all of this, you've written 3,000 to 5,000 lines of orchestration code that has nothing to do with your actual product.

Architectural Patterns That Actually Work

After watching teams ship and fail at this, the patterns that consistently produce sub-second agents:

Co-locate Everything

Pin your server, your STT provider, your LLM provider, and your TTS provider to the same region. If your telephony provider lets you choose where Media Streams terminate (Twilio supports US, Ireland, and Australia), use the closest one to the rest of your stack. This routinely shaves 100-200ms off every turn.

Stream Aggressively

Don't wait for full transcripts. Don't wait for full LLM responses. Don't wait for full TTS audio. The pipeline should be a continuous flow where each component starts processing whatever is available the moment it's available.

Speculative Generation

The moment partial transcripts look "complete enough", start the LLM. If the user keeps talking, cancel and restart. The cost of a wasted LLM call is much smaller than the latency saved when speculation is right (which it is most of the time).

Pre-warmed Connections

Don't open WebSocket connections to your STT and TTS providers per call. Maintain a warm pool. The TLS handshake alone costs 50-100ms.

Skip the Round-Trip on Predictable Replies

If a user just said "yes" to a confirmation, you don't need to round-trip the LLM. Your application logic knows what comes next. Bypass the model and emit the canned response directly. You'll find 10-20% of conversational turns fall into this category.

Use Smaller Models for Routing

A 1B-parameter model is fine for "is this user asking a clarifying question or starting a new request?" Save the frontier model for actual reasoning steps. This matters more for cost than latency, but it compounds.

This Is Exactly Why We Built the OnCallClerk SDK

We built voice agents the hard way for two years. Stitched together a telephony provider, an STT vendor, an LLM, and a TTS service. Wrote our own barge-in handler. Our own filler word logic. Our own reconnect state machine. Our own speculative streaming. Our own per-customer prompt caching.

It worked. It also took 18 months and three production incidents we never wanted to repeat.

The OnCallClerk SDK is what we wished we'd had on day one: a single API where you describe your agent's behavior, plug in your business logic via tool calls, and get a phone number that picks up at the latency of a fast human receptionist. The plumbing is invisible. The orchestration state machines are battle-tested. The endpoint detection is tuned per-conversation, not globally.

You write your agent's brain. We handle every millisecond of the pipeline.

If you're at the start of building this, read this article, save yourself 18 months, and skip to the API reference. If you're already deep in the stack and the latency is killing you, the SDK drops in alongside your existing telephony.

Keep Reading

How to Reduce Latency in Voice AI Agents - Diagnostic guide for slow existing agents
Why Voice AI Costs More Than Expected - Hidden cost breakdown
Twilio AI Voice Agent Tutorial - Build vs buy analysis
How to Make AI Voice Sound Human on Calls - Voice quality and prosody