your voice agent stack
five layers that require five different decisions
👋 Hey, I’m Suhas, and welcome to another edition of TPF Weekly!
Last Friday, we co-hosted a mixer for founders, leaders and operators in the Voice AI space.
Many young / 0-1 founders I spoke to asked me some version of: “Should we build our voice AI in-house or use existing infra from ElevenLabs / Vapi / Retell?”
Someone even brought up Apna's case study, which I dug into over the weekend and spent some time understanding.
I liked how the case study came through as a build on top of buy framing, that made me realize most people were looking at it as a build vs. buy thing and asking the wrong questions.
Quick check before we get into it.
The right frame I want to talk about today is which layers of the voice AI stack you own, and which you rent.
A voice agent is five products
A working voice agent is five separate products glued together, each with its own market, pricing model, and build-or-buy answer.
All five layers involve five different decisions.
Treating it as one decision is a common and rather an expensive mistake I’ve seen some PMs make.
I’ve observed that teams that get this right run a short diagnostic (roughly) before they ever talk to a vendor.
5 questions to ask before meeting the vendor
1. Is voice the product, or a feature inside the product?
If voice is your revenue line, you own more layers. If voice is how you deliver support, sales, or operations, you rent the stack.
Apna runs 1.5 million AI-powered interviews on its jobs platform.
The interview is the product, and while they own how it works, they rent the voice synthesis layer underneath.
The loan is the product, and they rent the entire voice stack.
Neither team built anything from scratch, instead they built on top of the existing voice AI infra.
2. Is your first-turn latency budget under 800ms?
Human conversations turn over in about 200-300 milliseconds.
Voice agents feel natural under 500ms, tolerable up to 800ms, and broken above 1.5 seconds.
If you need sub-800ms consistently, you need a platform that owns speech-to-text, text-to-speech, and turn-taking in one stack.
Stitching three vendors will not get you there, so pick the budget first, then figure out the architecture.
3. Are you running more than 50,000 minutes a month?
Below 50K minutes, every custom build I’ve seen has lost the math against managed platforms.
Above 50K, custom can save 60-80% on per-minute cost, but only with two engineers committed for 3 months minimum.
Most PMs budget for the initial build cost but underestimate the maintenance burden.
Every quarter, you’ll have to stay on top of model upgrades, prompt drift, regression testing, and dialect coverage.
If you don’t have voice AI talent on the team already, the volume threshold for “build” is closer to 200K minutes.
4. Is your compliance posture restricted by HIPAA, PCI, GDPR, or data residency?
Fused speech-to-speech models like OpenAI’s Realtime API sound impressive in demos but have no text layer to audit.
For regulated industries, that’s a non-starter.
5. Does your product need code-switching or low-resource languages?
If your users speak Hinglish, Tamil, or switch mid-sentence between English and Hindi, the “best speech-to-text” benchmarks built on US English don’t apply.
Apna’s AI interviewer modulates tone across Indian English, Hindi, and code-mixed speech, with sub-300ms latency.
Most global voice AI vendors do not get close to that across South Asian language combinations.
In India, Southeast Asia, and most of Africa, this question alone decides your vendor.
How the layered call plays out
I want to mention three companies, each with different layered decisions, that led to three different outcomes.
Better.com is the best example of layered decision done right.
They started with a fused speech-to-speech model, then switched to a cascaded modular stack with ElevenLabs Agents, swapping in lower latency and architectural control.
That switch is the layered decision in motion, and if you notice, the voice plumbing is rented.
While the mortgage logic, pricing engine, and 26,000+ product configurations are, still theirs.
The result: nearly 100,000 borrower calls handled monthly, 35.5% of inquiries resolved end-to-end, lead-to-lock conversion doubled, and a 41% reduction in the average cost to originate.
Instead of building “voice AI”, they built the layer that was important to their business and bought the layers that weren’t.
Klarna is the cautionary tale
In early 2024, Klarna replaced 700 customer service agents with a wholesale AI deployment and projected $40 million in savings.
Their failure was down to architecture, and hardly anything to do with AI quality.
There was no inspectable layer that decided which calls should be escalated and which AI should handle, every call got the same treatment, and the complex disputes broke CSAT.
When Klarna came back with ElevenLabs in February 2026, they kept the voice stack rented but built the escalation logic themselves.
Resolution times dropped 10x for the calls the AI was right to handle.
Back to Apna
They bought ElevenLabs’ Text to Speech for the conversational layer including Indian English, Hindi, and code-mixed speech from day one.
But kept all the role × company knowledge graph, persona-specific question generation, and orchestration logic in their own platform.
TLDR: Be like Apna, Use the best of what’s commoditized.
What to do before your next roadmap meeting
Write the five layers on a whiteboard: STT, LLM, TTS, orchestration, telephony.
For each one, write “own or rent” next to it.
Then write the threshold next to your decision i.e. the latency budget, volume number, compliance requirement, and language requirement.
If you can’t fill in the table, you’re not ready to evaluate vendors.
You’re still in the “should we add voice?” phase. The right question teams should be asking in 2026 is:
“What is the layer of this product that only we can build, and what are we paying someone else to handle so we can focus on that?”
Upcoming Events
What pricing pages confess when nobody’s watching
Chennai | May 30
Register Here
AI Vibe Sprint
Bengaluru | May 30
Register Here
AI Vibe Sprint
Jakarta | June 13
Register Here
Exclusive Jobs of the Week
Senior Product Manager
Lead Product Manager
Director Product
These and other roles open across top companies like Meesho, Google, Zomato & many more.
Reply with the voice AI vendor decision your team is wrestling with right now.
I read every one.
Suhas 👋🏻
P.S. This gets better when the right people are in the room. Share it with one.







