Build a Sub-200ms AI Voice Agent From Your Basement

Why we're building this
from home

We started researching sub-200ms AI voice agents for inbound and outbound phone calls — the kind that can handle real estate leads, divorce intake, appointment booking, and customer support with natural conversation at telephone speed.

The first question was build vs. buy. Cloud GPU servers from providers like RunPod, Lambda, and DigitalOcean run $1,360 to $2,400 per month for a single A100/H100 equivalent — and our workload needs 5 GPUs for 25 concurrent calls. That is $6,800 per month in cloud costs for the same performance a single home machine delivers.

We were originally researching used RTX 3090 builds when we found a smoking deal at Costco: the MSI Aegis Gaming Desktop with an RTX 5080 16GB, AMD Ryzen 9 9900X, 32GB DDR5, and 2TB NVMe — for $2,299.99 ($300 off, member savings valid through March 29, 2026).

The math was not close. A home server with Michigan grid electricity ($0.14/kWh) costs $150/month to operate. Cloud costs $1,350 to $2,550/month for equivalent compute. Cloud is 3.8x more expensive in Year 1 and never breaks even.

MSI Aegis ZS2 B9NVV

GPU

NVIDIA GeForce RTX 5080 16GB GDDR7

Enough VRAM for 7B parameter models with room for STT and TTS running simultaneously. Handles 25 concurrent voice sessions.

CPU

AMD Ryzen 9 9900X

12 cores, 24 threads. Handles SIP processing, WebSocket bridges, and orchestration while the GPU runs inference.

Memory

32GB DDR5 (upgradeable to 96GB)

2x DDR5 slots, maximum capacity 96GB. Enough headroom for model loading, audio buffers, and concurrent session state.

Storage

2TB NVMe SSD (3x M.2 slots)

Fast model loading from NVMe. Three M.2 slots for expansion — add dedicated drives for logs, recordings, and model weights.

Power

850W 80 Plus Gold PSU

Peak draw ~650W under full GPU load. At Michigan rates ($0.14/kWh), that is $66/month peak, $41/month average. Our $100/month budget has massive headroom.

Expansion

4x PCIe x16 · 10+ USB Ports

Room to add a second GPU, network cards, or SIP hardware. Full tower form factor with liquid cooling already installed.

How it works

Sub-200ms on phone calls is achievable when you architect for streaming and parallelism across STT, LLM, and TTS — instead of waiting for full utterances at each stage. Here is the call flow:

Voice AI Architecture: Phone Call → SIP Trunk → Asterisk/LiveKit → STT → LLM → TTS → Audio Response

SIP Trunks flowing into Gaming PC with AI brain processing and Text-to-Speech output

Latency budget

To hit near 200ms perceived response time, each stage has a strict budget:

Stage	Target	How
Network (caller to gateway)	< 50–70ms	SIP trunk, same-region
Gateway to AI server	< 50ms	Localhost (same machine)
STT (streaming partials)	< 100ms	Deepgram / local Whisper
LLM (time to first token)	200–400ms	3–7B model, vLLM, quantized
TTS (time to first audio)	100–300ms	VibeVoice / streaming TTS

Stage

Target

How

Network (caller to gateway)

< 50–70ms

SIP trunk, same-region

Gateway to AI server

< 50ms

Localhost (same machine)

STT (streaming partials)

< 100ms

Deepgram / local Whisper

LLM (time to first token)

200–400ms

3–7B model, vLLM, quantized

TTS (time to first audio)

100–300ms

VibeVoice / streaming TTS

The key insight: these stages run in parallel via streaming, not sequentially. STT sends partial transcripts to the LLM before the user finishes speaking. The LLM starts generating tokens on partial input. TTS starts synthesizing audio from the first tokens. The result is overlapping execution that makes the pipeline feel instantaneous.

Two architecture paths

Option 1: Asterisk + ExternalMedia (fully self-hosted)

Install Asterisk on the server. It handles SIP trunks, inbound DIDs, and outbound dialing. Use ARI (Asterisk REST Interface) + ExternalMedia to create a mixing bridge that pipes RTP audio to your AI process over UDP on localhost. Your Python/Go service decodes PCM, runs it through STT, feeds the LLM, generates TTS audio, and sends PCM back to the bridge. This is the "roll your own Twilio Media Streams" approach — full control, no per-minute fees.

Option 2: LiveKit Agents + SIP (modern media stack)

Self-host LiveKit Server on the same machine. It handles WebRTC/SIP sessions and routes low-latency audio to agent processes. LiveKit Agents SDK registers workers that receive audio frames for each call and pipe them through your local STT/LLM/TTS stack. Supports barge-in, turn detection, and multi-participant rooms out of the box.

Best LLMs for voice latency

For real-time voice agents, time-to-first-token (TTFT) is the critical metric — it determines when the response starts streaming to TTS. Target under 300ms for the LLM stage.

3–7B open models (Llama 3.1 8B, Mistral 7B variants) are the sweet spot for self-hosting. They hit 50–400ms TTFT depending on prompt length and quantization. Run them via vLLM or TensorRT-LLM for optimized inference with streaming output.

For TTS, Microsoft VibeVoice-Realtime-0.5B is an open-source MIT-licensed model achieving ~300ms latency via streaming generation — a strong option to avoid paid TTS APIs.

Turn detection and barge-in

Phone users talk over each other. You need VAD (Voice Activity Detection) + turn detection to allow interrupting the bot and cutting off TTS playback instantly. LiveKit's framework uses Silero VAD and a multilingual turn detector designed for telephony participants.

Repos you can fork today

These are concrete GitHub projects you can clone and run. Each one handles a different piece of the voice agent stack.

pBread/twilio-openai-minimalist-voicebot

Minimalist voice bot using Twilio Media Streams + WebSocket server + OpenAI Realtime API. Shows TwiML <Stream/>, WebSocket bridge, forwarding audio bidirectionally.

ericrisco/twilio-realtime-openai-rag

Incoming calls with WebSocket streaming between Twilio and OpenAI Realtime. Includes RAG-style backends, system prompts, and simulated business logic.

prakharbhardwaj/twilio-deepgram-voice-assistant

Twilio calls stream audio via Media Streams to a WebSocket endpoint, which pipes into Deepgram's Voice Agent API and streams responses back.

livekit/agents

Official LiveKit Agents framework. Build AI servers that see, hear, and speak in real time. Supports SIP telephony, plug-in STT/LLM/TTS combos, and worker pool distribution.

happyrobot-ai/livekit-agents

LiveKit Agents fork with Deepgram STT + GPT-4.1 mini + Cartesia TTS. Includes telephony-optimized noise cancellation and multilingual turn detection.

Microsoft VibeVoice-Realtime-0.5B

Open-source MIT-licensed TTS model achieving ~300ms latency via continuous acoustic tokenizer and streaming generation. Replaces paid TTS APIs.

Commercial providers
we evaluated

Before building our own, we evaluated the commercial landscape for sub-200ms voice agent platforms. These are the serious players:

Retell AI — Voice-first platform for support/sales agents. Reliable low-latency inbound and outbound calls with good interruption handling.

Telnyx Voice AI — CPaaS + AI stack for real-time agents with full call control. One of the leading voice AI providers for both inbound and outbound automation.

ElevenLabs — High-quality TTS with ~150ms latency for streaming speech. Often used as the TTS component in custom pipelines.

LiveKit + DruidX — Voice agents built on LiveKit with sub-200ms response latency. DruidX emphasizes LiveKit's low-latency infrastructure.

We are building our own because we need 24/7 operation at scale without per-minute costs, full data privacy for client conversations, and the ability to customize every layer of the stack for our specific use cases — real estate lead handling, divorce intake, and appointment scheduling.

Build a sub-200ms
AI voice agent
from your basement

Why we're building this
from home

MSI Aegis ZS2 B9NVV

Home server vs. cloud
The numbers do not lie

How it works

Latency budget

Two architecture paths

Option 1: Asterisk + ExternalMedia (fully self-hosted)

Option 2: LiveKit Agents + SIP (modern media stack)

Best LLMs for voice latency

Turn detection and barge-in

Repos you can fork today

Commercial providers
we evaluated

Want the results
without the build?

Build a sub-200msAI voice agentfrom your basement

Why we're building thisfrom home

MSI Aegis ZS2 B9NVV

Home server vs. cloudThe numbers do not lie

How it works

Latency budget

Two architecture paths

Option 1: Asterisk + ExternalMedia (fully self-hosted)

Option 2: LiveKit Agents + SIP (modern media stack)

Best LLMs for voice latency

Turn detection and barge-in

Repos you can fork today

Commercial providerswe evaluated

Want the resultswithout the build?

Build a sub-200ms
AI voice agent
from your basement

Why we're building this
from home

Home server vs. cloud
The numbers do not lie

Commercial providers
we evaluated

Want the results
without the build?