Why We Almost Gave Up on Voice AI
In early 2024, a client asked us to build a voice-based customer support agent for their insurance claims process. We thought: how hard can it be? Take an STT model, pipe it to an LLM, pipe the response to TTS. Done. Right? We were so wrong it's almost funny in retrospect. Almost.
Six months, three architecture rewrites, and a lot of humble pie later, we shipped something that actually works. Here's everything we learned.
Latency Is the Entire Game
In a text-based chatbot, 2-3 seconds of response time is fine. In a voice conversation, anything over 800ms feels broken. Users hang up. We measured this — dropout rate at 500ms latency was 8%, at 1.5 seconds it jumped to 41%. So the entire architecture had to be optimized for latency above all else.
Our first approach — Whisper for STT, GPT-4 for reasoning, ElevenLabs for TTS — had an average round-trip time of 4.2 seconds. Completely unusable. We switched to Deepgram for streaming STT (they process audio chunks as they arrive, not waiting for silence), GPT-4o-mini for reasoning (faster than GPT-4, good enough for our use case), and Deepgram's TTS for output (lower quality than ElevenLabs, but 3x faster). End-to-end latency: 1.1 seconds. Still not great, but usable.
The Interruption Problem
Real conversations aren't turn-based. People interrupt, they say "uh-huh" while you're talking, they start speaking before you finish. Our first version would just keep talking even when the user tried to interrupt, which made it feel robotic and infuriating. We implemented voice activity detection (VAD) using Silero VAD and built an interruption handler that could stop TTS playback mid-sentence and restart the conversation flow. This took three weeks to get right. The edge cases are brutal — background noise, the user coughing, their kid yelling in the background.
Indian English Is a Special Challenge
Our client's users are primarily in India, and the STT accuracy for Indian English accents was significantly lower than for American English. Whisper was the worst offender — about 15% word error rate on our test set of Indian English speakers vs 5% for American English. Deepgram was better at around 9%. We ended up building a post-processing layer that corrects common misrecognitions specific to Indian English ("lakh" misrecognized as "luck," "crore" as "core," etc.). This brought effective accuracy to about 94%, which was acceptable.
The Conversation Design Nobody Talks About
The LLM doesn't know how to have a natural phone conversation unless you teach it. It'll give long, rambling answers when a short confirmation is needed. It won't ask clarifying questions when the user's intent is ambiguous. It doesn't know when to escalate to a human. We spent more time on conversation design — writing detailed system prompts, building decision trees for escalation, creating fallback responses — than on any other part of the system. This is the unglamorous work that makes or breaks a voice agent.
Where We Are Now
The system handles about 2,000 calls per day with a 73% resolution rate (meaning the AI resolves the issue without needing a human). The client's cost per resolved call dropped from ₹85 to ₹12. It's not perfect — callers in noisy environments still struggle, and complex multi-step claims still need humans. But it's a real product serving real users, and it took us six months of failing to get there.