How to Reduce Latency in Voice AI Agents

You Already Have a Slow Agent. Now What?

This guide assumes you've already built a voice AI agent and it's slow. Users say "it feels weird", QA flags it as "robotic", and your CEO has stopped doing demos. You need to find where the milliseconds are going and which fixes will actually move the number.

If you're building from scratch, read How to Build a Low Latency AI Phone Agent first. This article is a diagnostic playbook for an agent that already exists.

Step 1: Stop Measuring the Wrong Thing

The first mistake every team makes is measuring latency as the LLM round-trip time. The LLM is rarely the slowest piece, and even when it is, it's never the only piece.

Instrument these four numbers separately, per turn, in production:

Endpoint detection latency - time from acoustic silence to your STT calling the turn complete
LLM time-to-first-token - request sent to first byte of useful response
TTS time-to-first-audio - first text token sent to first audio byte
Network egress time - first audio byte to first audio byte arriving at the telephony provider

The sum of these four is your real conversational latency. If you only have one of them logged, you're guessing.

The thresholds you should care about:

Stage	Good	Acceptable	Broken
Endpointing	<300ms	<600ms	>800ms
LLM TTFT	<400ms	<900ms	>1500ms
TTS TTFA	<250ms	<500ms	>800ms
Network egress	<100ms	<250ms	>400ms

If any one of those is in "Broken", fix that one before touching anything else. The fixes for each are different, and you can waste days optimizing the wrong stage.

Step 2: Find Out Which Stage Is the Problem

The fastest diagnostic is to log timestamps at every stage boundary in a single turn and dump them. Don't do this in aggregate yet. Look at three or four individual conversations and read the timestamps. You'll see the bottleneck immediately.

Common patterns:

Endpointing dominates. Your STT is waiting too long after the user stops talking. Fix in Step 3.
LLM dominates. Either the model is slow, your prompt is huge, or your tool-calling round-trips are stacking. Fix in Step 4.
TTS dominates. Either your TTS isn't streaming, you're using a non-streaming voice, or you're waiting for the full LLM response before starting it. Fix in Step 5.
Everything is okay individually but it still feels slow. You probably have buffering between stages. Fix in Step 6.

Step 3: Tighten Endpoint Detection

Endpoint detection (also called turn detection or VAD) is where the most latency hides. Default settings on every STT provider err toward "wait, the user might keep talking" because false interruption is way more annoying than dead air. For phone agents, this default is too conservative.

Your STT provider exposes one or more of these:

End-of-speech threshold (silence duration before declaring the turn over)
Endpointing model (some providers ship a learned model on top of VAD)
Interim results (whether you get partials)

Tactics that work:

Use Interim Results to Pre-warm the LLM

Most teams wait for the final transcript before calling the LLM. You don't have to. The moment you have an interim transcript that's grammatically complete (ends in punctuation, sentiment looks finished), fire a speculative LLM call. If the user keeps talking, cancel it. Most of the time, you save 300-500ms.

Deepgram documents this pattern under interim results. AssemblyAI and others have similar primitives.

Drop Endpointing to 300ms with a Smarter Turn Model

If your STT supports a learned end-of-turn model (some do, like Deepgram's "smart endpointing"), enable it and drop your fallback silence threshold aggressively. The learned model catches genuine turn ends, the silence threshold catches the rest.

Don't End-of-turn on Filler Words

If your transcript ends in "um", "uh", "let me think", don't fire the LLM. Wait. This requires post-processing the transcript before deciding the turn is done, but it eliminates the most annoying false-trigger cases.

Step 4: Speed Up the LLM Path

Once endpointing is tight, the LLM is usually next.

Prompt Caching Is the Single Biggest Win

If you're sending a 4,000-token system prompt on every turn, you're paying full prefill cost every time. Both OpenAI and Anthropic offer prompt caching that drops cached input pricing by 90%, and (more importantly) drops time-to-first-token because the prefill is already computed. Set up caching properly and you'll see TTFT drop from 800ms to 300ms.

Stop Sending the Full Conversation History

Most teams send the full conversation as messages every turn. After 10 turns this is multi-thousand tokens of irrelevant chatter. Summarize aggressively. Keep the last 3 turns verbatim, summarize everything older.

Pick the Model for the Turn, Not the App

Not every conversational turn needs your top model. "Is the user confirming or denying?" can be answered by a 1B parameter classifier. Save the frontier model for the turns that actually need reasoning. Two-tier routing routinely cuts average TTFT in half.

Watch Your Tool Call Round-Trips

Every tool call adds: LLM-1 → tool call → tool execution → LLM-2 → response. That's two LLM calls per turn. If your tool execution itself is slow (database query, third-party API), that compounds. Cache anything that doesn't change per call. Pre-fetch likely tools at conversation start.

Use Filler Audio During Tool Calls

When you know a tool call will take more than 500ms, immediately play "let me check that for you" while the tool runs. Real receptionists do this. Voice agents that don't sound like they're hung.

Step 5: Make TTS Stream

If your TTS is taking 600ms+ to return audio, it's probably not streaming. Confirm:

Are You Sending Tokens or Whole Sentences?

Streaming TTS like ElevenLabs streaming accepts text tokens and emits audio chunks. If you're calling `/text-to-speech` instead of `/text-to-speech/stream`, you're not streaming.

Are You Waiting for the Full LLM Response?

Worst pattern: collect entire LLM response, send to TTS, wait for full audio. Best pattern: pipe LLM tokens directly into TTS as they arrive. The TTS starts producing audio while the LLM is still generating. End-to-end latency drops by hundreds of milliseconds on every turn.

Are You Using a Slow Voice?

Some voice models are higher quality but slower. ElevenLabs' "Turbo" tier is much faster than their flagship at slight quality cost. For conversational use, the speed wins.

Is the Audio Codec Conversion Killing You?

Telephony providers want 8kHz μ-law (PSTN) or 16kHz PCM (Twilio Media Streams). If your TTS produces 22kHz MP3, every chunk needs decoding and resampling before transmission. Find a TTS that emits in your transport format directly, or use a fast native resampler.

Step 6: Eliminate Buffering Between Stages

Sometimes the individual stages are fine but the orchestration adds latency. Things to look for:

Are you accumulating LLM tokens in a buffer until you have a "complete sentence" before sending to TTS? Don't. TTS streams; let it.
Are you waiting for the full TTS audio before sending to telephony? Don't. Pipe chunks straight through.
Are you running the pipeline synchronously instead of async? Each `await` in series adds the slowest call's latency.
Is your event loop blocked by anything CPU-heavy (audio processing, JSON parsing of huge prompts)?

A good debugging trick: put a timestamp in every log line and grep for any sequential-looking pattern. If you see Stage A finish at T+500, Stage B start at T+520, Stage B finish at T+800, Stage C start at T+810 - that's clean pipelining. If Stage B doesn't start until 100ms after Stage A finished, something between them is buffering.

Step 7: The Network Layer Most People Ignore

You've optimized everything upstream. The agent still has noticeable lag. Check:

Are All Your Providers in the Same Region?

If your server is in us-east-1, your STT is in us-west-2, your LLM is in eastus, and your TTS is in eu-west-1, you're paying inter-region RTT on every single call. This is the dumbest 200ms you'll ever pay. Pin everything to one region.

Is Your Telephony Provider Terminating Media Locally?

Twilio's Media Streams terminate by default in US-East. If your callers are in Europe, you can configure Ireland (`IE1`) or Australia (`AU1`) regions. The savings are 100-150ms per leg.

TLS Handshakes Are Hidden Latency

Every new WebSocket connection includes a TLS handshake (1-2 round trips). If you're opening a fresh connection per call, that's 50-100ms of avoidable warmup. Maintain a warm pool.

A Realistic Optimization Sequence

If you're staring at a slow agent right now, here's the order with the highest return-on-effort:

Order	Fix	Time to Implement	Likely Win
1	Tighten endpointing to 300ms	1 hour	200-400ms
2	Enable prompt caching	2 hours	200-500ms
3	Stream LLM tokens into TTS	4 hours	200-400ms
4	Pin all providers to one region	1 day	100-200ms
5	Add filler audio during tools	1 day	Subjective huge
6	Speculative LLM on partial transcripts	2-3 days	200-400ms
7	Two-tier model routing	1-2 weeks	100-300ms
8	Custom turn detection model	2-4 weeks	100-200ms

Most teams stop at step 3 or 4 and ship something acceptable. The polish work in 5-8 is what separates good agents from great ones.

This Is Exactly Why We Built the OnCallClerk SDK

We optimized in this exact sequence on our own product, watched our latency drop from 2400ms to 700ms over six months, and decided we never wanted anyone else to have to do it. The OnCallClerk SDK ships with all of these fixes built in: tuned endpointing, prompt-cached prompts, streamed token-to-audio pipelining, regional co-location, and filler-word audio during tool calls. You bring your business logic. We bring the conversational latency.

If you're already deep in the stack, the SDK drops in alongside your telephony provider. If you're earlier, the API reference is the fastest way to skip the entire optimization journey.

Keep Reading

How to Build a Low Latency AI Phone Agent - Architecture for new builds
Why Voice AI Costs More Than Expected - Cost decomposition
Cheapest Way to Run a Voice AI Agent - Minimum viable spend
How to Make AI Voice Sound Human on Calls - Quality and prosody