Why Voice AI Costs More Than Expected

The Spreadsheet vs Reality Gap

Every team that builds a voice AI agent runs the same spreadsheet at the start. They look at LLM pricing, TTS pricing, STT pricing, and telephony pricing. They multiply, add a fudge factor, and conclude: "About 4 cents per minute. Easy."

Six months later they're paying 18 cents per minute, the gross margin model is broken, and nobody's quite sure where the money went.

This guide explains where it went. The actual cost of a production voice agent is roughly 3 to 5x the naive component math, and the reasons are non-obvious until you've shipped one. Understanding why is essential to either pricing the product correctly or picking an architecture that doesn't bleed cash.

The Naive Math

Here's the calculation everyone runs first. Numbers approximated from public pricing pages, current as of writing.

Component	Public Price	Per Minute (rough)
STT (streaming)	$0.0043 / minute	$0.0043
LLM (input + output, frontier model)	See below	$0.015
TTS (streaming)	~$0.18 / 1000 chars	$0.018
Telephony (inbound US)	$0.0085 / minute + number rental	$0.012
Naive total		$0.049

About 5 cents per minute. Pricing your service at $0.30/minute and you've got 84% gross margins. Easy money.

Except every line in that table is roughly half of what you'll actually pay.

Where the Real Cost Comes From

1. Tokens Per Minute Is Way Higher Than You Think

The first miscalculation is LLM token usage. People estimate "the agent says about 30 words per turn, the user says about 30 words per turn, that's 60 words per minute, 80 tokens, easy."

Real conversation token counts:

Component	Tokens Per Minute
User speech (transcript)	100-150
Agent speech (output)	150-250
System prompt (sent every turn)	1500-4000
Conversation history (grows)	200-3000
Tool definitions	200-1000
Tool call results	100-500

Notice the "sent every turn" line. Most architectures send the full system prompt, the conversation history, and the tool schema on every single LLM call. A 4-minute call with a 2,500-token system prompt and 5 tool definitions can easily generate 30,000+ input tokens, not the 400 you estimated.

OpenAI's pricing page and Anthropic's pricing both let you see the dramatic gap between cached and uncached input tokens. Without prompt caching, every turn pays full prefill cost. With caching, cached input is roughly 10% of uncached cost. Most teams ship without caching and discover it 6 months later.

Real LLM cost per minute (no caching, frontier model): $0.04 to $0.10, not $0.015.

2. Speech Synthesis Costs By Character, And Agents Talk More Than You Expect

TTS is priced by characters generated. The naive math assumes the agent says ~150 words per minute, ~750 characters. But voice agents don't have edit-distance discipline like a chat UI. They:

Repeat user inputs back for confirmation ("So that's 123 Main Street, right?")
Add filler phrases ("let me check that for you")
Restart sentences after barge-ins (you pay for the audio that got interrupted)
Speak more slowly than humans, with more padding

Real character generation per active minute is closer to 1,200-1,800 characters, not 750. Real TTS cost per minute: $0.025 to $0.040, not $0.018.

3. Telephony Has Hidden Line Items

The headline rate looks reasonable. The actual bill includes:

Inbound voice minutes (the headline number)
Number rental ($1-3/month per number, amortized over your call volume)
Media Streams charges (sometimes a separate per-minute fee)
A2P 10DLC registration fees (US SMS, if you also send SMS)
Recording storage (if you record calls, which most do for QA)
Call transfer minutes (if your agent transfers, you pay both legs)
International origination (different rates for international callers)

For a phone-heavy product, true all-in telephony is closer to $0.020-$0.030 per minute.

4. Failed Turns Cost Real Money

In production, ~10-15% of conversational turns fail in some way:

LLM returns malformed JSON for a tool call → retry
TTS generates audio but barge-in cancels it (you paid for the synthesis)
STT transcript is wrong, agent responds incorrectly, user repeats themselves (extra turn)
Network glitch, conversation continues but the failed turn was billed
Speculative LLM call gets cancelled but already cost tokens

You pay for all of these. Budget 15% overhead on every component for retry / cancellation / error handling.

5. Observability and Logging Aren't Free

You need:

Call recording storage (audio is big - roughly 1MB per minute)
Transcript storage and indexing (search, evals, fine-tuning data)
Metrics and trace storage (if you instrumented properly)
Long-term audit logs for compliance

For a 10,000-minute month, that's 10GB of audio recordings plus transcripts plus traces. On any major cloud, $200-500/month minimum, scaling roughly linearly.

6. Compliance and Account Management Costs

This is the hidden killer for serious products:

Twilio business verification ($95 + monthly fees for some configurations)
STIR/SHAKEN attestation for outbound calls
A2P 10DLC campaign registration in the US
GDPR/CCPA data handling (if you're EU/CA serving)
HIPAA-grade infrastructure (if you touch healthcare)
Voice cloning consent management (if you offer custom voices)

These are mostly fixed costs, but they have real per-minute amortization at low scale. A new product doing 5,000 minutes/month is paying $200-500 in compliance overhead, which adds 4-10 cents per minute on top of everything else.

The Real Per-Minute Cost

Adding up the realistic numbers:

Component	Naive	Real
STT	$0.0043	$0.005
LLM (frontier, no caching)	$0.015	$0.06
TTS	$0.018	$0.032
Telephony all-in	$0.012	$0.025
Failed-turn overhead (15%)	$0.000	$0.018
Observability/logging	$0.000	$0.008
Compliance amortization	$0.000	$0.015
Real total	$0.049	$0.163

About 16 cents per minute. Roughly 3.3x the spreadsheet number.

This is why so many "voice AI" startups quietly raise prices, drop frontier models for cheaper ones, or pivot to higher-margin niches. The original unit economics didn't work.

The Typical "Optimization" Trap

Once teams discover the real cost, the first instinct is to swap to cheaper components:

Drop to a smaller LLM. Quality tanks, conversion drops, churn goes up. Net loss.
Drop to a cheaper TTS. Voice sounds robotic, callers hang up early, missed leads. Net loss.
Drop endpointing aggressively. Agent interrupts callers, complaints. Net loss.
Skip recording. Can't debug, can't QA, can't fine-tune. Slower iteration.

The real optimizations are architectural, not component-level:

Prompt Caching Is the Biggest Lever

If your system prompt is 3,000 tokens and you're not caching it, you're paying full price 60+ times per call. Enable caching and that drops by ~90% on the cached portion. On a typical agent, this saves 4-6 cents per minute, more than any other single change.

Tier Your Models

Not every turn needs your frontier model. Use a small model for routing, classification, and simple confirmations. Save the big model for the 30% of turns that need real reasoning.

Summarize Conversation History

Once you're past 5-6 turns, summarize. Don't send the full transcript every turn. This single change cuts input tokens by 50-70% on long calls.

Pre-fetch Likely Tools

If a caller is in a booking flow, pre-warm the calendar tool. The tool latency disappears, you don't pay for waiting LLM tokens.

Cache Tool Results

Most calls hit the same data: business hours, services list, FAQ answers. Cache aggressively. The tool's compute cost AND the LLM's "interpret tool result" tokens both go away.

This Is Exactly Why We Built the OnCallClerk SDK

When we shipped our first voice agent, we hit the same 16-cents-per-minute wall and spent six months pulling it down to a sustainable number. Every fix in this article is a fix we ran ourselves: caching strategies, model tiering, tool result caching, smart endpointing that doesn't waste tokens on speculation we can't use.

The OnCallClerk SDK bakes all of this in. Per-customer prompt caching is automatic. Tool results are deduplicated. Conversation history is summarized in the background. Two-tier routing happens by default for the turns that don't need a frontier model. We charge you a flat per-minute rate so the cost is predictable, and we eat the variance.

If you've already shipped and you're seeing the bill, the API reference shows what a clean architecture looks like. If you're earlier, the savings calculator compares the real all-in cost of building vs using a managed agent.

Keep Reading

How to Build a Low Latency AI Phone Agent - Architecture overview
Cheapest Way to Run a Voice AI Agent - Minimum viable spend
Twilio AI Voice Agent Tutorial - Build vs buy honest analysis
How to Reduce Latency in Voice AI Agents - Performance debugging