Why Speed Matters More Than You Think
In voice AI, the gap between "we want to build this" and "we have a phone number ringing for a real customer" is where most products die. Not because the technology is hard (it is, but it's tractable), but because the orchestration sprawl is unbounded. There's always one more edge case, one more integration, one more failure mode. Six months later, you're still not in production.
Time-to-market matters more in voice AI than in most product categories because:
- The category is moving fast. The model that's best today won't be best in 9 months.
- Customers form opinions of voice AI from one bad call. You need real production data to fix the issues that matter, and you only get that by being live.
- Buyers are evaluating multiple vendors right now. Whoever can demo today wins the meeting; whoever ships in 6 months loses the deal.
This article is about getting a real voice agent in production, taking real calls, in 2 to 4 weeks instead of 4 to 8 months. The shortcuts that work, the shortcuts that don't, and where to draw the line.
The Typical Stack You'll Try First
Most teams start by gluing together best-of-breed components. The architecture looks something like:
```
Twilio (or Telnyx) for telephony
+ Streaming STT provider
+ Frontier LLM
+ Streaming TTS provider
+ Your own orchestration server
+ Your own state machine for barge-in / turn detection / function calls
```
This is also the architecture documented in most tutorials, including Twilio's Media Streams docs and frameworks like LiveKit Agents.
It works. The catch: the "your own orchestration server" line is 80% of the engineering work and 95% of the time you'll spend debugging.
The Realistic Build Timeline
For an in-house voice agent built from components, here's what shipping actually looks like for a competent team. Each row is the wall-clock time, not the engineer-hours.
| Milestone | Realistic Time |
|---|---|
| First prototype that answers the phone and says hello | 2-3 days |
| Hold a basic conversation with an LLM in the loop | 1 week |
| Real STT + TTS streaming, not robotic | 1-2 weeks |
| Tool calling works and looks up your data | 2-3 weeks |
| Barge-in handling that doesn't talk over callers | 3-5 weeks |
| Reconnect logic that survives WebSocket drops | 4-6 weeks |
| Filler audio during tool calls | 5-7 weeks |
| Acceptable latency on a real phone (under 1 second) | 6-10 weeks |
| First real production customer with the agent live | 8-14 weeks |
| Voice quality consistent enough for a B2B sales demo | 12-20 weeks |
| Cost economics that work at scale | 20-32 weeks |
Six to eight months is the typical range to a defensible product, and that assumes nothing goes catastrophically wrong with telephony provisioning, model regressions, or audio codec bugs.
The Two-Week Path to Production
If your goal is "live phone number, real callers, useful conversations" in 2 weeks, this is the realistic plan.
Week 1: Pick a Managed Voice Layer
Don't write your own orchestration. The OpenAI Realtime API gives you audio-in, audio-out with built-in turn detection. LiveKit Agents gives you a framework. Managed APIs that abstract telephony plus voice AI cut out 80% of the work that doesn't differentiate your product.
The build-from-scratch path saves you maybe 3-5 cents per minute at scale. The managed path saves you 4-6 months of engineering. Pick whichever your business actually needs.
Week 1: Pick a Voice Provider You Don't Hate
The voice is the product. You can A/B voices later, but pick something that doesn't sound robotic on your first launch. ElevenLabs streaming is the safest choice and integrates with most managed voice frameworks.
Week 1: Define the Three Things Your Agent Does
Don't try to build a general-purpose receptionist. Pick three specific intents. "Book an appointment", "answer pricing questions", "transfer to a human". Anything beyond those three should fall through gracefully to "let me have someone call you back".
A focused agent that handles three things well will ship in a week. A general agent that handles "anything a caller might say" won't ship in a year.
Week 2: Wire Up One Tool
If your agent only needs to do one thing well, wire up one tool. Calendar booking, CRM lookup, support ticket creation. Anything more is scope creep.
This is the integration that matters: the LLM emits a function call, your code executes it, the result goes back to the LLM, the agent speaks the answer. Get this loop tight for ONE tool and you've shipped a useful product.
Week 2: Get a Phone Number, Run 50 Test Calls
Buy a number. Hit it from your phone. Hit it from another phone. Hit it from a phone with a kid screaming in the background. Try to confuse the agent. Hang up mid-sentence. Talk over it. Take notes.
You'll find 20 bugs. Fix the 5 worst ones. Ship.
End of Week 2: Live with Real Callers
Send the number to one real customer. Watch their first 10 calls go through your transcripts. You'll see exactly what doesn't work, and you'll fix it on real signal instead of imagined edge cases.
The Shortcuts That Work
These are the shortcuts that genuinely save time without breaking the product.
Skip Custom Voice Selection
Pick one good voice. Don't build a voice picker. Don't fine-tune anything. Don't experiment with cloning. Voice picking is a v2 feature.
Skip Multi-Tenancy
If you have one customer, don't build a multi-tenant config system. Hard-code their config. Refactor when you have customer #2.
Skip Custom Knowledge Base
If your agent needs to know the customer's pricing, hard-code it in the system prompt. Skip the vector database. Skip the RAG pipeline. You can add retrieval when "I'm not sure about that" becomes the actual bottleneck for the customer.
Skip Voicemail Detection (At First)
Let the agent talk to voicemail boxes. It looks dumb. Your customer cares about real callers, not voicemail boxes. Add detection in v2.
Skip Custom Telephony
Don't deploy SIP servers. Don't run Asterisk. Don't terminate PSTN yourself. The cost difference is real but only at very large scale, and you'll be there in v3.
Skip Most Analytics
Log calls and transcripts. That's it. Don't build a dashboard. Don't build sentiment analysis. Don't build cohort retention reports. You can read the transcripts in your head for the first 1,000 calls.
The Shortcuts That Will Wreck You
These are the shortcuts that look like time-savers and turn into nightmares.
Don't Skip Recording
You will need to debug a real customer complaint. You will need to retrain on real data. You will need to defend yourself in a "the agent said something it didn't say" dispute. Recording is the cheapest insurance available.
Don't Skip Compliance Checkboxes
If you're in the US, A2P 10DLC and STIR/SHAKEN aren't optional. Carriers will throttle you. You don't have to do them perfectly, but you have to start them in week 1, because approval takes time.
Don't Skip Barge-in Handling
If your agent talks over callers, callers will hang up. This is the single fastest way to lose every demo and every prospect. Even a crappy barge-in handler is better than none.
Don't Skip Latency Testing on Real Phones
Your laptop is on WiFi with a low-latency mic. Real callers are on 4G. The latency difference is 200-400ms. Test on a real phone or your beta will fail.
Don't Skip Reconnect Logic
WebSockets drop. Yours will drop in production. If you don't handle it, the call fails silently. Build at least basic reconnect logic, even if it's "rebuild the conversation state from the LLM message history".
Build vs Buy: A Decision Framework
The honest decision tree:
Use a fully-managed voice agent service if:
- You don't have voice AI engineers
- Your differentiation is your data or your workflow, not the voice tech
- You need to ship in under 4 weeks
- Per-call volume is under 200,000 minutes/month (where managed economics still work)
Use a managed voice layer (OpenAI Realtime, LiveKit, etc.) if:
- You have engineering capacity but not voice-specific expertise
- You want control over prompts, tools, and conversation flow
- You're willing to maintain orchestration code
- You're comfortable being a level closer to the metal
Build it from scratch with raw STT + LLM + TTS if:
- Voice AI IS your product (you're a voice AI vendor)
- You have specific latency or cost targets that managed offerings can't hit
- You have or are hiring senior engineers with telephony experience
- You're prepared for a 6-month timeline
For 90% of products, the second or first options are the correct answer. Building from scratch is romantic, expensive, and rarely the right call.
This Is Exactly Why We Built the OnCallClerk SDK
The OnCallClerk SDK is the "buy and skip the build" option for teams that want to ship a voice AI product in 2 weeks instead of 6 months. You write your business logic. You define your tools. You describe your agent's behavior. We give you back a phone number and an agent that has all the orchestration, latency tuning, barge-in handling, reconnect logic, and cost optimizations you'd otherwise spend a year building.
The API reference shows the surface area: a few endpoints to configure your agent, a webhook for tool calls, and you're shipping. The SDK drops in alongside whatever else you're building. Most teams have a working agent on day one and a customer on it within a week.
If your competitor is going to ship in 4 weeks, you don't have 6 months.
Keep Reading
- How to Build a Low Latency AI Phone Agent - Architecture from scratch
- Why Voice AI Costs More Than Expected - The honest cost breakdown
- Cheapest Way to Run a Voice AI Agent - Minimum viable spend
- How to Make AI Voice Sound Human on Calls - Voice quality
