Voice chatbots broke my brain. Text chatbots? Easy mode - nobody cares if a response takes two seconds. But a 1.7-second pause in a phone conversation? That's your caller hanging up.
Here's what I learned building a multilingual voice AI system from scratch.
The Pipeline That Looks Simple (It's Not)
On paper, a voice chatbot is four steps:
- User speaks → microphone captures audio
- Speech-to-Text (STT) → transcribe what they said
- LLM → generate a response
- Text-to-Speech (TTS) → speak the response back
Every step adds latency. STT takes 900ms, LLM takes 1,200ms for first token, TTS needs 350ms to connect plus 180ms to generate. Nearly 2.9 seconds before the user hears a syllable.
AGGREGATE TIMING STATISTICS:
STT Latency: avg= 909ms
Claude TTFT: avg= 1278ms
First Audio Latency: avg= 1958ms
User→First Audio: avg= 2889ms
Lesson 1: Geography Is a First-Class Design Parameter
I was obsessing over code optimizations - shaving 10ms here, 20ms there. Then I moved my server closer to my API providers and cut latency in half.
Nick Tikhonov's writeup showed this clearly: moving the orchestration layer and using regional endpoints dropped his end-to-end latency from ~1.6s to ~690ms. No code changes. Just geography.
- Co-locate your orchestrator with your heaviest API calls - If your LLM provider has US-East endpoints, your server should be in US-East
- Check regional endpoints for every service - ElevenLabs, Deepgram, your LLM provider all have regional options
- Measure from your deployment region - Local dev benchmarks are meaningless for production latency
Lesson 2: Stream Everything, Buffer Nothing
John Carmack's latency mitigation piece (hosted on Dan Luu's site) drives home that true latency reduction beats buffering every time. In VR, you time-warp rendered frames. In voice AI, you stream your entire pipeline:
- Stream STT - Send partial transcripts to the LLM as they come in
- Stream LLM → TTS - Feed tokens into TTS as they arrive, don't wait for the full response
- Stream TTS → Audio - Send audio frames immediately, don't buffer the whole utterance
# The wrong way: sequential
transcript = await stt.transcribe(audio) # wait...
response = await llm.generate(transcript) # wait...
audio = await tts.synthesize(response) # wait...
play(audio) # finally!
# The right way: streaming pipeline
# (simplified - real impl buffers to sentence boundaries for TTS)
async for partial in stt.stream(audio):
async for token in llm.stream(partial):
async for chunk in tts.stream(token):
play_immediately(chunk) # 🔥 so much faster
This took my perceived latency from "awkward pause" to "thinking for a moment."
Lesson 3: Pre-Connect Everything
WebSocket handshakes cost 100-300ms each (TLS, protocol upgrades, auth tokens). My approach: start the TTS WebSocket handshake the moment VAD detects end-of-speech, while STT is still finalizing. By the time Claude has its first token, TTS is already connected. My logs show "Using pre-connected TTS WebSocket (saved ~130ms)" on every turn.
Lesson 4: First Token Latency Is Everything
In voice, Time to First Token (TTFT) is the only LLM metric that matters. Your user doesn't care about total generation time - they care about how long the silence lasts.
Claude Sonnet 4.5 averages 1,278ms TTFT in my system, climbing from 1,150ms on Turn 1 to 1,396ms by Turn 3 as conversation history grows. Nick measured Groq at ~80ms versus OpenAI's ~250-300ms - a 3x difference that dominates the pipeline.
I use Claude because multilingual response quality is worth the TTFT tradeoff. But the biggest win was switching from the synchronous SDK to AsyncAnthropic. The sync client blocked my entire event loop, adding 1-2 seconds of artificial latency.
# This was blocking my entire event loop 💀
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(...)
# This let everything run concurrently 🚀
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async with client.messages.stream(...) as stream:
async for text in stream.text_stream:
await tts.send(text)
Lesson 5: Interruption Handling Is Harder Than You Think
Without proper handling: the AI keeps talking for 5 seconds after "never mind," background noise triggers false interruptions, and a cough derails everything.
My solution: a multi-stage validation chain.
- VAD detects speech during AI output → start listening
- Early STT check at 0.8s → actual words or just noise?
- If real → cancel LLM generation and TTS
- Final validation at 2.0s with stricter thresholds → commit to interruption
- Cooldown period → prevent re-triggering from residual audio
The 0.8s early detection is the sweet spot. In my last production call: 3 interruption attempts, all 3 validated as real speech, average VAD confidence 0.975, zero false positives. The cooldown period prevents the system from ping-ponging between speaking and listening states.
Lesson 6: State Machines Save Your Sanity
My first version was a mess of boolean flags: is_speaking, is_listening, is_processing, is_interrupted. Impossible to debug.
I replaced it with six explicit states: LISTENING_IDLE, LISTENING_ACTIVE, PROCESSING, AI_GENERATING, AI_SPEAKING, INTERRUPTED. Each has explicit entry/exit conditions and allowed transitions. No more impossible states.
Lesson 7: Multilingual Is a Whole Separate Problem
I needed English, Spanish, French, and Tagalog. "Just change the system prompt language" works for high-resource languages (97-99% of English quality) but not for lower-resource ones like Tagalog.
Research from the Multilingual Prompt Engineering survey (36 papers, 250 languages) and the TALENT framework (Translate After LEarNing Textbook) shaped my approach for Tagalog - three complexity tiers:
- Simple queries → Direct prompting
- Moderate complexity → Translate to English, Chain-of-Thought, translate back (+11-37% improvement)
- Complex reasoning → TALENT-inspired structured reasoning (+14.8% improvement)
Auto-detecting complexity from keywords and word count lets the system pick the right strategy transparently.
Lesson 8: Don't Build What's Already Built (Unless You Need To)
Pipecat has 40+ AI provider integrations and handles the plumbing you don't want to write. Deepgram Flux combines VAD and transcription into one service. I built custom for control over multilingual routing, interruption thresholds, and RAG integration. But if you're starting out? Use a framework first, then decide if you need to go custom.
The Architecture That Actually Works
Phone Call (Twilio Media Streams WebSocket)
↓
Orchestrator (FastAPI + asyncio)
├─ Audio Input Pipeline
│ ├─ Silero VAD (ONNX-optimized, shared singleton)
│ └─ ElevenLabs Scribe (streaming STT)
├─ Conversation State Machine (6 states)
├─ Interruption Detector (multi-stage validation)
├─ Language Detector + Router
│ └─ Complexity-based prompt selection
├─ Response Generator
│ ├─ Optional RAG (LightRAG)
│ └─ Claude (async streaming)
└─ Audio Output Pipeline
└─ ElevenLabs TTS (streaming WebSocket, pre-connected)
Key design decisions: one module per concern, async from the start, dataclass configs with dependency injection, context variables for logging, and a turn epoch system to prevent stale callbacks from earlier turns.
The Numbers That Matter
Turn 1: stt= 997ms ttft= 1150ms tts_connect= 350ms tts_gen= 174ms user→audio= 2808ms
Turn 2: stt= 850ms ttft= 1287ms tts_connect= 325ms tts_gen= 664ms user→audio= 2832ms
Turn 3: stt= 881ms ttft= 1396ms tts_connect= 378ms tts_gen= 182ms user→audio= 3027ms
~2.9 seconds average. Not good. Barely usable. But every number is a lever:
- STT (~900ms) - Streaming STT or Deepgram Flux could cut this significantly
- Claude TTFT (~1,278ms) - The elephant. A faster model or speculative generation could halve this
- TTS connect (~350ms) - Pre-connection saves ~130ms already; a persistent pool could eliminate it
- TTS generation (~174-664ms) - Varies with how much text Claude has buffered before the first TTS send
I'm not at sub-500ms like Nick achieved with Groq. But I know where every millisecond goes.
Try It Yourself
- Start with Pipecat - Get something working in a day
- Measure latency at every stage - Log timestamps at every pipeline boundary
- Go async from day one - Retrofitting is painful
- Deploy close to your providers - This beats any code optimization
- Build interruption handling early - Without it, people hang up
Voice AI seems simple until you do it. Every 50ms you shave off, the conversation feels less robotic. I'm at 2.9 seconds. I know how to get to 1.
Further Reading
- How I Built a Sub-500ms Latency Voice Agent from Scratch by Nick Tikhonov - The best practical writeup on voice agent latency, with real measurements from 1.6s to sub-500ms
- Latency Mitigation Strategies by John Carmack (via Dan Luu) - Written about VR, but the "buffer nothing" mindset applies directly to voice pipelines
- Multilingual Prompt Engineering in LLMs: A Survey - Prompting strategies across 250 languages
- Pipecat - Open-source Python framework for voice AI
- Deepgram Flux - Combines turn detection and transcription
Photo by Volodymyr Hryshchenko on Unsplash
Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.