The Spreadsheet vs Reality Gap
Every team that builds a voice AI agent runs the same spreadsheet at the start. They look at LLM pricing, TTS pricing, STT pricing, and telephony pricing. They multiply, add a fudge factor, and conclude: "About 4 cents per minute. Easy."
Six months later they're paying 18 cents per minute, the gross margin model is broken, and nobody's quite sure where the money went.
This guide explains where it went. The actual cost of a production voice agent is roughly 3 to 5x the naive component math, and the reasons are non-obvious until you've shipped one. Understanding why is essential to either pricing the product correctly or picking an architecture that doesn't bleed cash.
The Naive Math
Here's the calculation everyone runs first. Numbers approximated from public pricing pages, current as of writing.
| Component | Public Price | Per Minute (rough) |
|---|---|---|
| STT (streaming) | $0.0043 / minute | $0.0043 |
| LLM (input + output, frontier model) | See below | $0.015 |
| TTS (streaming) | ~$0.18 / 1000 chars | $0.018 |
| Telephony (inbound US) | $0.0085 / minute + number rental | $0.012 |
| Naive total | $0.049 |
About 5 cents per minute. Pricing your service at $0.30/minute and you've got 84% gross margins. Easy money.
Except every line in that table is roughly half of what you'll actually pay.
Where the Real Cost Comes From
1. Tokens Per Minute Is Way Higher Than You Think
The first miscalculation is LLM token usage. People estimate "the agent says about 30 words per turn, the user says about 30 words per turn, that's 60 words per minute, 80 tokens, easy."
Real conversation token counts:
| Component | Tokens Per Minute |
|---|---|
| User speech (transcript) | 100-150 |
| Agent speech (output) | 150-250 |
| System prompt (sent every turn) | 1500-4000 |
| Conversation history (grows) | 200-3000 |
| Tool definitions | 200-1000 |
| Tool call results | 100-500 |
Notice the "sent every turn" line. Most architectures send the full system prompt, the conversation history, and the tool schema on every single LLM call. A 4-minute call with a 2,500-token system prompt and 5 tool definitions can easily generate 30,000+ input tokens, not the 400 you estimated.
OpenAI's pricing page and Anthropic's pricing both let you see the dramatic gap between cached and uncached input tokens. Without prompt caching, every turn pays full prefill cost. With caching, cached input is roughly 10% of uncached cost. Most teams ship without caching and discover it 6 months later.
Real LLM cost per minute (no caching, frontier model): $0.04 to $0.10, not $0.015.
2. Speech Synthesis Costs By Character, And Agents Talk More Than You Expect
TTS is priced by characters generated. The naive math assumes the agent says ~150 words per minute, ~750 characters. But voice agents don't have edit-distance discipline like a chat UI. They:
- Repeat user inputs back for confirmation ("So that's 123 Main Street, right?")
- Add filler phrases ("let me check that for you")
- Restart sentences after barge-ins (you pay for the audio that got interrupted)
- Speak more slowly than humans, with more padding
Real character generation per active minute is closer to 1,200-1,800 characters, not 750. Real TTS cost per minute: $0.025 to $0.040, not $0.018.
3. Telephony Has Hidden Line Items
The headline rate looks reasonable. The actual bill includes:
- Inbound voice minutes (the headline number)
- Number rental ($1-3/month per number, amortized over your call volume)
- Media Streams charges (sometimes a separate per-minute fee)
- A2P 10DLC registration fees (US SMS, if you also send SMS)
- Recording storage (if you record calls, which most do for QA)
- Call transfer minutes (if your agent transfers, you pay both legs)
- International origination (different rates for international callers)
For a phone-heavy product, true all-in telephony is closer to $0.020-$0.030 per minute.
4. Failed Turns Cost Real Money
In production, ~10-15% of conversational turns fail in some way:
- LLM returns malformed JSON for a tool call → retry
- TTS generates audio but barge-in cancels it (you paid for the synthesis)
- STT transcript is wrong, agent responds incorrectly, user repeats themselves (extra turn)
- Network glitch, conversation continues but the failed turn was billed
- Speculative LLM call gets cancelled but already cost tokens
You pay for all of these. Budget 15% overhead on every component for retry / cancellation / error handling.
5. Observability and Logging Aren't Free
You need:
- Call recording storage (audio is big - roughly 1MB per minute)
- Transcript storage and indexing (search, evals, fine-tuning data)
- Metrics and trace storage (if you instrumented properly)
- Long-term audit logs for compliance
For a 10,000-minute month, that's 10GB of audio recordings plus transcripts plus traces. On any major cloud, $200-500/month minimum, scaling roughly linearly.
6. Compliance and Account Management Costs
This is the hidden killer for serious products:
- Twilio business verification ($95 + monthly fees for some configurations)
- STIR/SHAKEN attestation for outbound calls
- A2P 10DLC campaign registration in the US
- GDPR/CCPA data handling (if you're EU/CA serving)
- HIPAA-grade infrastructure (if you touch healthcare)
- Voice cloning consent management (if you offer custom voices)
These are mostly fixed costs, but they have real per-minute amortization at low scale. A new product doing 5,000 minutes/month is paying $200-500 in compliance overhead, which adds 4-10 cents per minute on top of everything else.
The Real Per-Minute Cost
Adding up the realistic numbers:
| Component | Naive | Real |
|---|---|---|
| STT | $0.0043 | $0.005 |
| LLM (frontier, no caching) | $0.015 | $0.06 |
| TTS | $0.018 | $0.032 |
| Telephony all-in | $0.012 | $0.025 |
| Failed-turn overhead (15%) | $0.000 | $0.018 |
| Observability/logging | $0.000 | $0.008 |
| Compliance amortization | $0.000 | $0.015 |
| Real total | $0.049 | $0.163 |
About 16 cents per minute. Roughly 3.3x the spreadsheet number.
This is why so many "voice AI" startups quietly raise prices, drop frontier models for cheaper ones, or pivot to higher-margin niches. The original unit economics didn't work.
The Typical "Optimization" Trap
Once teams discover the real cost, the first instinct is to swap to cheaper components:
- Drop to a smaller LLM. Quality tanks, conversion drops, churn goes up. Net loss.
- Drop to a cheaper TTS. Voice sounds robotic, callers hang up early, missed leads. Net loss.
- Drop endpointing aggressively. Agent interrupts callers, complaints. Net loss.
- Skip recording. Can't debug, can't QA, can't fine-tune. Slower iteration.
The real optimizations are architectural, not component-level:
Prompt Caching Is the Biggest Lever
If your system prompt is 3,000 tokens and you're not caching it, you're paying full price 60+ times per call. Enable caching and that drops by ~90% on the cached portion. On a typical agent, this saves 4-6 cents per minute, more than any other single change.
Tier Your Models
Not every turn needs your frontier model. Use a small model for routing, classification, and simple confirmations. Save the big model for the 30% of turns that need real reasoning.
Summarize Conversation History
Once you're past 5-6 turns, summarize. Don't send the full transcript every turn. This single change cuts input tokens by 50-70% on long calls.
Pre-fetch Likely Tools
If a caller is in a booking flow, pre-warm the calendar tool. The tool latency disappears, you don't pay for waiting LLM tokens.
Cache Tool Results
Most calls hit the same data: business hours, services list, FAQ answers. Cache aggressively. The tool's compute cost AND the LLM's "interpret tool result" tokens both go away.
This Is Exactly Why We Built the OnCallClerk SDK
When we shipped our first voice agent, we hit the same 16-cents-per-minute wall and spent six months pulling it down to a sustainable number. Every fix in this article is a fix we ran ourselves: caching strategies, model tiering, tool result caching, smart endpointing that doesn't waste tokens on speculation we can't use.
The OnCallClerk SDK bakes all of this in. Per-customer prompt caching is automatic. Tool results are deduplicated. Conversation history is summarized in the background. Two-tier routing happens by default for the turns that don't need a frontier model. We charge you a flat per-minute rate so the cost is predictable, and we eat the variance.
If you've already shipped and you're seeing the bill, the API reference shows what a clean architecture looks like. If you're earlier, the savings calculator compares the real all-in cost of building vs using a managed agent.
Keep Reading
- How to Build a Low Latency AI Phone Agent - Architecture overview
- Cheapest Way to Run a Voice AI Agent - Minimum viable spend
- Twilio AI Voice Agent Tutorial - Build vs buy honest analysis
- How to Reduce Latency in Voice AI Agents - Performance debugging
