OnCallClerk Logo
Back to blog
ARTICLEGuide

Cheapest Way to Run a Voice AI Agent

A pragmatic guide to running a voice AI agent on the smallest possible budget. Covers minimum viable architectures, what tradeoffs are acceptable, and where cost-cutting will silently destroy your product.

OnCallClerk Team·April 29, 2026·10 min read

What "Cheapest" Actually Means

When developers ask "what's the cheapest way to run a voice AI agent?", they usually mean one of three different things:

  1. Cheapest to build (smallest engineering investment up front)
  2. Cheapest to run (lowest per-minute marginal cost in production)
  3. Cheapest to operate (lowest total cost including humans-in-the-loop, debugging, ongoing iteration)

These three answers are very different, and choosing the wrong one is how teams burn six months building something they can't afford to operate.

This guide walks through what minimum-viable looks like for each, and where cutting cost will quietly break your product.


The Bare-Bones Architecture

If you stripped a voice AI agent down to the absolute minimum components that can hold a real conversation on a phone line, this is what you'd have:

```

Caller → Telephony Provider → Your Server →

STT (streaming) → Small LLM → TTS (streaming) → back through →

Telephony → Caller

```

That's it. No retrieval. No tools. No memory. No analytics. No barge-in. The thing answers the phone, transcribes the caller, hands the text to a model, speaks the reply, and loops.

You can build this in a weekend. You can run it for under 4 cents per minute. It will also be useless for almost any real product, because:

  • It can't look anything up (no tools, no database access)
  • It can't transfer to a human
  • It can't remember anything across calls
  • It can't be interrupted (talks over callers)
  • It has no failure handling (drops calls when anything glitches)

But it's the floor. Everything past this point is paying for capability or quality.


Cheapest to Build

If your goal is "ship something to test the idea, don't optimize for production":

Use a Managed Telephony Voice API

Twilio's Voice with Media Streams or similar gets you a phone number and audio streaming with about 100 lines of code. You don't need to know SIP. You don't need a Kamailio server. You don't need PSTN expertise.

Use the OpenAI Realtime API or Equivalent

Frameworks like the OpenAI Realtime API or LiveKit Agents collapse the STT + LLM + TTS stages into a single API call. You give it audio in, it gives you audio out. The integration code drops from ~2,000 lines (your own STT/LLM/TTS pipeline) to ~200.

You're paying more per minute for the convenience. For prototyping or low-volume products, the savings on engineering time vastly outweigh the runtime cost difference. A senior engineer's time at $150/hour for a month is $24,000. You'd have to run 480,000 minutes of calls to break even on the cheaper-per-minute self-built version, and most early products don't.

Skip Custom Infrastructure

Don't build a Kubernetes cluster. Don't build CI/CD. Don't write Terraform. A single VM, environment variables, a process supervisor, and you're done. You can move to fancy infrastructure when you have product-market fit and someone is paying you for the calls.

Total Build Cost: ~1-2 Engineering Weeks

You'll have a thing on a phone line that can answer questions in roughly 1 to 2 weeks. It will be slow, it will sometimes drop calls, and it will not handle barge-in well. That's fine for a v0.


Cheapest to Run

Once you have call volume, the per-minute cost matters more than the build cost. The optimization here is different.

Stage 1: Use a Tiered Model Stack

A frontier model on every turn is expensive. A small model is cheap but bad at reasoning. The cheap-and-good answer is two-tier routing:

  • A small classifier model decides what kind of turn this is
  • Most turns (greetings, confirmations, simple lookups) go to a small fast model
  • Complex turns (multi-step reasoning, ambiguous user input) escalate to the frontier model

Done well, this routes 60-80% of turns to the small model and cuts LLM cost by half or more.

Stage 2: Cache Your System Prompt

If you're not using prompt caching, this is the single biggest free win available. OpenAI and Anthropic both offer it, with cached input pricing roughly 90% lower than uncached. For an agent with a 3,000-token system prompt called 60 times per call, this saves real money.

Stage 3: Aggressively Summarize Conversation History

Don't send the full transcript every turn. After turn 4 or 5, summarize older turns into a 200-token recap. The LLM doesn't need verbatim "user said hello" from 8 turns ago.

Stage 4: Cache Tool Results

If your agent looks up business hours, services, prices, FAQ answers, those are static. Cache the tool result for the duration of the call (or longer). You save the tool's compute cost AND the LLM tokens spent re-interpreting the same tool result on subsequent turns.

Stage 5: Region-Pin Everything

This doesn't save money directly, but it prevents wasted tokens. If your network is slow because providers are in different regions, you'll get more interrupted/cancelled turns, more retries, more "user repeated themselves because the agent was confused" turns. All of those are billable. Faster network → cleaner conversations → lower per-minute cost.

Realistic Cheap Production Cost

With all of the above, a well-tuned voice agent runs about 6-8 cents per minute in component costs. Without any of it, the same agent runs 15-20 cents per minute. The gap is real engineering work, but the math compounds for any high-volume product.


Where Cheap Will Wreck You

Some "cost optimizations" feel smart and quietly destroy the product. Avoid these.

Don't Skimp on TTS

The voice is the product. Callers form their entire opinion of your business in the first 2 seconds of hearing the agent talk. A robotic voice loses leads. Cheap TTS providers exist, but the quality gap is audible and the conversion gap is real. ElevenLabs streaming and equivalents cost more, but they're worth it.

Don't Skimp on STT

If your STT mishears callers, your LLM gives wrong answers, callers repeat themselves, conversations get longer, costs go up anyway. False economy. Use a streaming-quality provider like Deepgram or AssemblyAI.

Don't Skip Recording and Logging

"We don't need recording, that's expensive storage." Six months later you can't debug a customer complaint, you have no data to fine-tune on, and your eval pipeline doesn't exist. Recording and transcript storage are <1 cent per minute on any cloud. Skip it and you're saving pennies while losing months of iteration speed.

Don't Use the Cheapest LLM by Default

A small model is fine for routing. It's not fine as your primary reasoning model unless your domain is genuinely simple. If callers can ask anything that requires multi-step reasoning, you need a real model. Underspending here shows up as conversation quality complaints and low conversion.

Don't Skip Compliance

Skipping STIR/SHAKEN, A2P 10DLC, or relevant data-protection requirements is "free" until you hit a wall. Twilio will quietly de-prioritize unverified traffic. Carriers will block you. GDPR fines exist. Build compliance in early.


A Cheap-But-Not-Stupid Reference Stack

If you want a defensible architecture that's about as cheap as you can get without breaking the product:

ComponentChoiceApproximate Cost / Min
TelephonyTwilio Voice + Media Streams$0.013
STTStreaming provider (Deepgram tier or equivalent)$0.005
LLM (small, routing)Smaller model on cheap pricing tier$0.005
LLM (large, reasoning)Frontier model with caching, only ~30% of turns$0.025
TTSStreaming, fast voice tier$0.020
Recording / logsCloud storage at scale$0.005
Total~$0.073

About 7 cents per minute, all in. That's roughly half of what most teams pay because they didn't tier the model, didn't cache the prompt, and didn't summarize history.


Cheapest to Operate Is Different

The numbers above are runtime cost. They ignore engineering time, debugging time, on-call time, and the cost of every conversation that goes wrong because of a barge-in bug or a reconnect failure.

Realistic operational costs for an in-house voice stack:

  • 1-2 engineers spending 30-50% of their time on the agent (latency tuning, prompt iteration, integration fixes)
  • One on-call rotation when calls fail (because they will)
  • Continuous prompt evaluation and regression testing

For a small team doing 50,000 minutes per month, that's $15,000+ per month in fully-loaded engineering cost on top of the $3,500 in runtime cost. The runtime is the small line item.

The cheapest way to operate is often the most expensive way to run. A managed solution at 12 cents per minute is more expensive than an in-house solution at 7 cents per minute, but if it removes one engineer-month per quarter of work, it's saving you money.


This Is Exactly Why We Built the OnCallClerk SDK

We've watched teams pick "cheapest to run" and burn 6 months of engineering on optimizations they didn't need yet. We've watched teams pick "cheapest to build" and ship something they couldn't afford to scale. The OnCallClerk SDK is the third option: managed runtime, sane defaults, and you only pay for the conversation, not the orchestration work.

Per-customer prompt caching is automatic. Two-tier model routing is built in. Tool results are cached. Region pinning is handled. You bring your business logic via the API, and we handle the cost optimizations that take engineers months to discover.


Keep Reading

Tags
cheap voice ailow cost voice agentvoice ai budgetvoice ai architecturevoice ai cost optimization

Ready to try AI voice agents?

Set up your first AI phone agent in minutes. No coding required.