What "Cheapest" Actually Means
When developers ask "what's the cheapest way to run a voice AI agent?", they usually mean one of three different things:
- Cheapest to build (smallest engineering investment up front)
- Cheapest to run (lowest per-minute marginal cost in production)
- Cheapest to operate (lowest total cost including humans-in-the-loop, debugging, ongoing iteration)
These three answers are very different, and choosing the wrong one is how teams burn six months building something they can't afford to operate.
This guide walks through what minimum-viable looks like for each, and where cutting cost will quietly break your product.
The Bare-Bones Architecture
If you stripped a voice AI agent down to the absolute minimum components that can hold a real conversation on a phone line, this is what you'd have:
```
Caller → Telephony Provider → Your Server →
↓
STT (streaming) → Small LLM → TTS (streaming) → back through →
↓
Telephony → Caller
```
That's it. No retrieval. No tools. No memory. No analytics. No barge-in. The thing answers the phone, transcribes the caller, hands the text to a model, speaks the reply, and loops.
You can build this in a weekend. You can run it for under 4 cents per minute. It will also be useless for almost any real product, because:
- It can't look anything up (no tools, no database access)
- It can't transfer to a human
- It can't remember anything across calls
- It can't be interrupted (talks over callers)
- It has no failure handling (drops calls when anything glitches)
But it's the floor. Everything past this point is paying for capability or quality.
Cheapest to Build
If your goal is "ship something to test the idea, don't optimize for production":
Use a Managed Telephony Voice API
Twilio's Voice with Media Streams or similar gets you a phone number and audio streaming with about 100 lines of code. You don't need to know SIP. You don't need a Kamailio server. You don't need PSTN expertise.
Use the OpenAI Realtime API or Equivalent
Frameworks like the OpenAI Realtime API or LiveKit Agents collapse the STT + LLM + TTS stages into a single API call. You give it audio in, it gives you audio out. The integration code drops from ~2,000 lines (your own STT/LLM/TTS pipeline) to ~200.
You're paying more per minute for the convenience. For prototyping or low-volume products, the savings on engineering time vastly outweigh the runtime cost difference. A senior engineer's time at $150/hour for a month is $24,000. You'd have to run 480,000 minutes of calls to break even on the cheaper-per-minute self-built version, and most early products don't.
Skip Custom Infrastructure
Don't build a Kubernetes cluster. Don't build CI/CD. Don't write Terraform. A single VM, environment variables, a process supervisor, and you're done. You can move to fancy infrastructure when you have product-market fit and someone is paying you for the calls.
Total Build Cost: ~1-2 Engineering Weeks
You'll have a thing on a phone line that can answer questions in roughly 1 to 2 weeks. It will be slow, it will sometimes drop calls, and it will not handle barge-in well. That's fine for a v0.
Cheapest to Run
Once you have call volume, the per-minute cost matters more than the build cost. The optimization here is different.
Stage 1: Use a Tiered Model Stack
A frontier model on every turn is expensive. A small model is cheap but bad at reasoning. The cheap-and-good answer is two-tier routing:
- A small classifier model decides what kind of turn this is
- Most turns (greetings, confirmations, simple lookups) go to a small fast model
- Complex turns (multi-step reasoning, ambiguous user input) escalate to the frontier model
Done well, this routes 60-80% of turns to the small model and cuts LLM cost by half or more.
Stage 2: Cache Your System Prompt
If you're not using prompt caching, this is the single biggest free win available. OpenAI and Anthropic both offer it, with cached input pricing roughly 90% lower than uncached. For an agent with a 3,000-token system prompt called 60 times per call, this saves real money.
Stage 3: Aggressively Summarize Conversation History
Don't send the full transcript every turn. After turn 4 or 5, summarize older turns into a 200-token recap. The LLM doesn't need verbatim "user said hello" from 8 turns ago.
Stage 4: Cache Tool Results
If your agent looks up business hours, services, prices, FAQ answers, those are static. Cache the tool result for the duration of the call (or longer). You save the tool's compute cost AND the LLM tokens spent re-interpreting the same tool result on subsequent turns.
Stage 5: Region-Pin Everything
This doesn't save money directly, but it prevents wasted tokens. If your network is slow because providers are in different regions, you'll get more interrupted/cancelled turns, more retries, more "user repeated themselves because the agent was confused" turns. All of those are billable. Faster network → cleaner conversations → lower per-minute cost.
Realistic Cheap Production Cost
With all of the above, a well-tuned voice agent runs about 6-8 cents per minute in component costs. Without any of it, the same agent runs 15-20 cents per minute. The gap is real engineering work, but the math compounds for any high-volume product.
Where Cheap Will Wreck You
Some "cost optimizations" feel smart and quietly destroy the product. Avoid these.
Don't Skimp on TTS
The voice is the product. Callers form their entire opinion of your business in the first 2 seconds of hearing the agent talk. A robotic voice loses leads. Cheap TTS providers exist, but the quality gap is audible and the conversion gap is real. ElevenLabs streaming and equivalents cost more, but they're worth it.
Don't Skimp on STT
If your STT mishears callers, your LLM gives wrong answers, callers repeat themselves, conversations get longer, costs go up anyway. False economy. Use a streaming-quality provider like Deepgram or AssemblyAI.
Don't Skip Recording and Logging
"We don't need recording, that's expensive storage." Six months later you can't debug a customer complaint, you have no data to fine-tune on, and your eval pipeline doesn't exist. Recording and transcript storage are <1 cent per minute on any cloud. Skip it and you're saving pennies while losing months of iteration speed.
Don't Use the Cheapest LLM by Default
A small model is fine for routing. It's not fine as your primary reasoning model unless your domain is genuinely simple. If callers can ask anything that requires multi-step reasoning, you need a real model. Underspending here shows up as conversation quality complaints and low conversion.
Don't Skip Compliance
Skipping STIR/SHAKEN, A2P 10DLC, or relevant data-protection requirements is "free" until you hit a wall. Twilio will quietly de-prioritize unverified traffic. Carriers will block you. GDPR fines exist. Build compliance in early.
A Cheap-But-Not-Stupid Reference Stack
If you want a defensible architecture that's about as cheap as you can get without breaking the product:
| Component | Choice | Approximate Cost / Min |
|---|---|---|
| Telephony | Twilio Voice + Media Streams | $0.013 |
| STT | Streaming provider (Deepgram tier or equivalent) | $0.005 |
| LLM (small, routing) | Smaller model on cheap pricing tier | $0.005 |
| LLM (large, reasoning) | Frontier model with caching, only ~30% of turns | $0.025 |
| TTS | Streaming, fast voice tier | $0.020 |
| Recording / logs | Cloud storage at scale | $0.005 |
| Total | ~$0.073 |
About 7 cents per minute, all in. That's roughly half of what most teams pay because they didn't tier the model, didn't cache the prompt, and didn't summarize history.
Cheapest to Operate Is Different
The numbers above are runtime cost. They ignore engineering time, debugging time, on-call time, and the cost of every conversation that goes wrong because of a barge-in bug or a reconnect failure.
Realistic operational costs for an in-house voice stack:
- 1-2 engineers spending 30-50% of their time on the agent (latency tuning, prompt iteration, integration fixes)
- One on-call rotation when calls fail (because they will)
- Continuous prompt evaluation and regression testing
For a small team doing 50,000 minutes per month, that's $15,000+ per month in fully-loaded engineering cost on top of the $3,500 in runtime cost. The runtime is the small line item.
The cheapest way to operate is often the most expensive way to run. A managed solution at 12 cents per minute is more expensive than an in-house solution at 7 cents per minute, but if it removes one engineer-month per quarter of work, it's saving you money.
This Is Exactly Why We Built the OnCallClerk SDK
We've watched teams pick "cheapest to run" and burn 6 months of engineering on optimizations they didn't need yet. We've watched teams pick "cheapest to build" and ship something they couldn't afford to scale. The OnCallClerk SDK is the third option: managed runtime, sane defaults, and you only pay for the conversation, not the orchestration work.
Per-customer prompt caching is automatic. Two-tier model routing is built in. Tool results are cached. Region pinning is handled. You bring your business logic via the API, and we handle the cost optimizations that take engineers months to discover.
Keep Reading
- Why Voice AI Costs More Than Expected - The hidden cost breakdown
- Fastest Way to Launch a Voice AI Product - Time-to-market focus
- How to Reduce Latency in Voice AI Agents - Performance debugging
- How to Make AI Voice Sound Human on Calls - Voice quality
