I Built a Voice Chatbot and Latency Ruined Everything (Until It Didn't)

Three crumpled yellow papers on green surface surrounded by yellow lined papers

Voice chatbots broke my brain. Text chatbots? Easy mode - nobody cares if a response takes two seconds. But a 1.7-second pause in a phone conversation? That's your caller hanging up.

Here's what I learned building a multilingual voice AI system from scratch.

The Pipeline That Looks Simple (It's Not)

On paper, a voice chatbot is four steps:

User speaks → microphone captures audio
Speech-to-Text (STT) → transcribe what they said
LLM → generate a response
Text-to-Speech (TTS) → speak the response back

Every step adds latency. STT takes 900ms, LLM takes 1,200ms for first token, TTS needs 350ms to connect plus 180ms to generate. Nearly 2.9 seconds before the user hears a syllable.

AGGREGATE TIMING STATISTICS:
STT Latency:           avg=   909ms
Claude TTFT:           avg=  1278ms
First Audio Latency:   avg=  1958ms
User→First Audio:      avg=  2889ms

Lesson 1: Geography Is a First-Class Design Parameter

I was obsessing over code optimizations - shaving 10ms here, 20ms there. Then I moved my server closer to my API providers and cut latency in half.

Nick Tikhonov's writeup showed this clearly: moving the orchestration layer and using regional endpoints dropped his end-to-end latency from ~1.6s to ~690ms. No code changes. Just geography.

Co-locate your orchestrator with your heaviest API calls - If your LLM provider has US-East endpoints, your server should be in US-East
Check regional endpoints for every service - ElevenLabs, Deepgram, your LLM provider all have regional options
Measure from your deployment region - Local dev benchmarks are meaningless for production latency

Lesson 2: Stream Everything, Buffer Nothing

John Carmack's latency mitigation piece (hosted on Dan Luu's site) drives home that true latency reduction beats buffering every time. In VR, you time-warp rendered frames. In voice AI, you stream your entire pipeline:

Stream STT - Send partial transcripts to the LLM as they come in
Stream LLM → TTS - Feed tokens into TTS as they arrive, don't wait for the full response
Stream TTS → Audio - Send audio frames immediately, don't buffer the whole utterance

# The wrong way: sequential
transcript = await stt.transcribe(audio)      # wait...
response = await llm.generate(transcript)      # wait...
audio = await tts.synthesize(response)          # wait...
play(audio)                                     # finally!

# The right way: streaming pipeline
# (simplified - real impl buffers to sentence boundaries for TTS)
async for partial in stt.stream(audio):
    async for token in llm.stream(partial):
        async for chunk in tts.stream(token):
            play_immediately(chunk)              # 🔥 so much faster

This took my perceived latency from "awkward pause" to "thinking for a moment."

Lesson 3: Pre-Connect Everything

WebSocket handshakes cost 100-300ms each (TLS, protocol upgrades, auth tokens). My approach: start the TTS WebSocket handshake the moment VAD detects end-of-speech, while STT is still finalizing. By the time Claude has its first token, TTS is already connected. My logs show "Using pre-connected TTS WebSocket (saved ~130ms)" on every turn.

Lesson 4: First Token Latency Is Everything

In voice, Time to First Token (TTFT) is the only LLM metric that matters. Your user doesn't care about total generation time - they care about how long the silence lasts.

Claude Sonnet 4.5 averages 1,278ms TTFT in my system, climbing from 1,150ms on Turn 1 to 1,396ms by Turn 3 as conversation history grows. Nick measured Groq at ~80ms versus OpenAI's ~250-300ms - a 3x difference that dominates the pipeline.

I use Claude because multilingual response quality is worth the TTFT tradeoff. But the biggest win was switching from the synchronous SDK to AsyncAnthropic. The sync client blocked my entire event loop, adding 1-2 seconds of artificial latency.

# This was blocking my entire event loop 💀
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(...)

# This let everything run concurrently 🚀
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async with client.messages.stream(...) as stream:
    async for text in stream.text_stream:
        await tts.send(text)

Lesson 5: Interruption Handling Is Harder Than You Think

Without proper handling: the AI keeps talking for 5 seconds after "never mind," background noise triggers false interruptions, and a cough derails everything.

My solution: a multi-stage validation chain.

VAD detects speech during AI output → start listening
Early STT check at 0.8s → actual words or just noise?
If real → cancel LLM generation and TTS
Final validation at 2.0s with stricter thresholds → commit to interruption
Cooldown period → prevent re-triggering from residual audio

The 0.8s early detection is the sweet spot. In my last production call: 3 interruption attempts, all 3 validated as real speech, average VAD confidence 0.975, zero false positives. The cooldown period prevents the system from ping-ponging between speaking and listening states.

Lesson 6: State Machines Save Your Sanity

My first version was a mess of boolean flags: is_speaking, is_listening, is_processing, is_interrupted. Impossible to debug.

I replaced it with six explicit states: LISTENING_IDLE, LISTENING_ACTIVE, PROCESSING, AI_GENERATING, AI_SPEAKING, INTERRUPTED. Each has explicit entry/exit conditions and allowed transitions. No more impossible states.

Lesson 7: Multilingual Is a Whole Separate Problem

I needed English, Spanish, French, and Tagalog. "Just change the system prompt language" works for high-resource languages (97-99% of English quality) but not for lower-resource ones like Tagalog.

Research from the Multilingual Prompt Engineering survey (36 papers, 250 languages) and the TALENT framework (Translate After LEarNing Textbook) shaped my approach for Tagalog - three complexity tiers:

Simple queries → Direct prompting
Moderate complexity → Translate to English, Chain-of-Thought, translate back (+11-37% improvement)
Complex reasoning → TALENT-inspired structured reasoning (+14.8% improvement)

Auto-detecting complexity from keywords and word count lets the system pick the right strategy transparently.

Lesson 8: Don't Build What's Already Built (Unless You Need To)

Pipecat has 40+ AI provider integrations and handles the plumbing you don't want to write. Deepgram Flux combines VAD and transcription into one service. I built custom for control over multilingual routing, interruption thresholds, and RAG integration. But if you're starting out? Use a framework first, then decide if you need to go custom.

The Architecture That Actually Works

Phone Call (Twilio Media Streams WebSocket)
    ↓
Orchestrator (FastAPI + asyncio)
    ├─ Audio Input Pipeline
    │   ├─ Silero VAD (ONNX-optimized, shared singleton)
    │   └─ ElevenLabs Scribe (streaming STT)
    ├─ Conversation State Machine (6 states)
    ├─ Interruption Detector (multi-stage validation)
    ├─ Language Detector + Router
    │   └─ Complexity-based prompt selection
    ├─ Response Generator
    │   ├─ Optional RAG (LightRAG)
    │   └─ Claude (async streaming)
    └─ Audio Output Pipeline
        └─ ElevenLabs TTS (streaming WebSocket, pre-connected)

Key design decisions: one module per concern, async from the start, dataclass configs with dependency injection, context variables for logging, and a turn epoch system to prevent stale callbacks from earlier turns.

The Numbers That Matter

Turn  1: stt=  997ms  ttft= 1150ms  tts_connect= 350ms  tts_gen= 174ms  user→audio= 2808ms
Turn  2: stt=  850ms  ttft= 1287ms  tts_connect= 325ms  tts_gen= 664ms  user→audio= 2832ms
Turn  3: stt=  881ms  ttft= 1396ms  tts_connect= 378ms  tts_gen= 182ms  user→audio= 3027ms

~2.9 seconds average. Not good. Barely usable. But every number is a lever:

STT (~900ms) - Streaming STT or Deepgram Flux could cut this significantly
Claude TTFT (~1,278ms) - The elephant. A faster model or speculative generation could halve this
TTS connect (~350ms) - Pre-connection saves ~130ms already; a persistent pool could eliminate it
TTS generation (~174-664ms) - Varies with how much text Claude has buffered before the first TTS send

I'm not at sub-500ms like Nick achieved with Groq. But I know where every millisecond goes.

Try It Yourself

Start with Pipecat - Get something working in a day
Measure latency at every stage - Log timestamps at every pipeline boundary
Go async from day one - Retrofitting is painful
Deploy close to your providers - This beats any code optimization
Build interruption handling early - Without it, people hang up

Voice AI seems simple until you do it. Every 50ms you shave off, the conversation feels less robotic. I'm at 2.9 seconds. I know how to get to 1.