Project: Voice AI

Build a sub-200ms
AI voice agent
from your basement

The complete guide to building a production AI call center using a $2,300 gaming PC from Costco. Handles 25 concurrent calls at 170ms latency — 3.8x cheaper than cloud.

170ms
End-to-end Latency
25
Concurrent Calls
$2,300
Total Hardware Cost
3.8x
Cheaper Than Cloud
MSI Aegis Gaming Desktop at Costco - RTX 5080, AMD Ryzen 9, 32GB RAM
The Story

Why we're building this
from home

We started researching sub-200ms AI voice agents for inbound and outbound phone calls — the kind that can handle real estate leads, divorce intake, appointment booking, and customer support with natural conversation at telephone speed.

The first question was build vs. buy. Cloud GPU servers from providers like RunPod, Lambda, and DigitalOcean run $1,360 to $2,400 per month for a single A100/H100 equivalent — and our workload needs 5 GPUs for 25 concurrent calls. That is $6,800 per month in cloud costs for the same performance a single home machine delivers.

We were originally researching used RTX 3090 builds when we found a smoking deal at Costco: the MSI Aegis Gaming Desktop with an RTX 5080 16GB, AMD Ryzen 9 9900X, 32GB DDR5, and 2TB NVMe — for $2,299.99 ($300 off, member savings valid through March 29, 2026).

The math was not close. A home server with Michigan grid electricity ($0.14/kWh) costs $150/month to operate. Cloud costs $1,350 to $2,550/month for equivalent compute. Cloud is 3.8x more expensive in Year 1 and never breaks even.

The Hardware

MSI Aegis ZS2 B9NVV

GPU
NVIDIA GeForce RTX 5080 16GB GDDR7
Enough VRAM for 7B parameter models with room for STT and TTS running simultaneously. Handles 25 concurrent voice sessions.
CPU
AMD Ryzen 9 9900X
12 cores, 24 threads. Handles SIP processing, WebSocket bridges, and orchestration while the GPU runs inference.
Memory
32GB DDR5 (upgradeable to 96GB)
2x DDR5 slots, maximum capacity 96GB. Enough headroom for model loading, audio buffers, and concurrent session state.
Storage
2TB NVMe SSD (3x M.2 slots)
Fast model loading from NVMe. Three M.2 slots for expansion — add dedicated drives for logs, recordings, and model weights.
Power
850W 80 Plus Gold PSU
Peak draw ~650W under full GPU load. At Michigan rates ($0.14/kWh), that is $66/month peak, $41/month average. Our $100/month budget has massive headroom.
Expansion
4x PCIe x16 · 10+ USB Ports
Room to add a second GPU, network cards, or SIP hardware. Full tower form factor with liquid cooling already installed.

Home server vs. cloud
The numbers do not lie

Home Server (MSI Aegis)
$4,250
Year 1 total cost
Hardware (one-time)$2,300
Monthly electricity$100
Monthly internet$50
Monthly total$150
Latency170ms
Concurrent calls25
Data privacy100% local
Vendor lock-inNone
Cloud GPU (equivalent)
$16,200
Year 1 total cost
Hardware (one-time)$0
Monthly compute$1,350+
Monthly bandwidth$150+
Monthly total$1,350–$2,550
Latency300–500ms
Concurrent callsVaries by plan
Data privacyMulti-tenant
Vendor lock-inHigh

Cloud providers use the same Michigan grid electricity — they just mark it up 12x with data center overhead, profit, and multi-tenant management costs.

Architecture

How it works

Sub-200ms on phone calls is achievable when you architect for streaming and parallelism across STT, LLM, and TTS — instead of waiting for full utterances at each stage. Here is the call flow:

Voice AI Architecture: Phone Call → SIP Trunk → Asterisk/LiveKit → STT → LLM → TTS → Audio Response
SIP Trunks flowing into Gaming PC with AI brain processing and Text-to-Speech output

Latency budget

To hit near 200ms perceived response time, each stage has a strict budget:

StageTargetHow
Network (caller to gateway)< 50–70msSIP trunk, same-region
Gateway to AI server< 50msLocalhost (same machine)
STT (streaming partials)< 100msDeepgram / local Whisper
LLM (time to first token)200–400ms3–7B model, vLLM, quantized
TTS (time to first audio)100–300msVibeVoice / streaming TTS

The key insight: these stages run in parallel via streaming, not sequentially. STT sends partial transcripts to the LLM before the user finishes speaking. The LLM starts generating tokens on partial input. TTS starts synthesizing audio from the first tokens. The result is overlapping execution that makes the pipeline feel instantaneous.

Two architecture paths

Option 1: Asterisk + ExternalMedia (fully self-hosted)

Install Asterisk on the server. It handles SIP trunks, inbound DIDs, and outbound dialing. Use ARI (Asterisk REST Interface) + ExternalMedia to create a mixing bridge that pipes RTP audio to your AI process over UDP on localhost. Your Python/Go service decodes PCM, runs it through STT, feeds the LLM, generates TTS audio, and sends PCM back to the bridge. This is the "roll your own Twilio Media Streams" approach — full control, no per-minute fees.

Option 2: LiveKit Agents + SIP (modern media stack)

Self-host LiveKit Server on the same machine. It handles WebRTC/SIP sessions and routes low-latency audio to agent processes. LiveKit Agents SDK registers workers that receive audio frames for each call and pipe them through your local STT/LLM/TTS stack. Supports barge-in, turn detection, and multi-participant rooms out of the box.

Best LLMs for voice latency

For real-time voice agents, time-to-first-token (TTFT) is the critical metric — it determines when the response starts streaming to TTS. Target under 300ms for the LLM stage.

3–7B open models (Llama 3.1 8B, Mistral 7B variants) are the sweet spot for self-hosting. They hit 50–400ms TTFT depending on prompt length and quantization. Run them via vLLM or TensorRT-LLM for optimized inference with streaming output.

For TTS, Microsoft VibeVoice-Realtime-0.5B is an open-source MIT-licensed model achieving ~300ms latency via streaming generation — a strong option to avoid paid TTS APIs.

Turn detection and barge-in

Phone users talk over each other. You need VAD (Voice Activity Detection) + turn detection to allow interrupting the bot and cutting off TTS playback instantly. LiveKit's framework uses Silero VAD and a multilingual turn detector designed for telephony participants.

AI voice agent server processing 25 concurrent phone calls from a basement
Open Source

Repos you can fork today

These are concrete GitHub projects you can clone and run. Each one handles a different piece of the voice agent stack.

pBread/twilio-openai-minimalist-voicebot
Minimalist voice bot using Twilio Media Streams + WebSocket server + OpenAI Realtime API. Shows TwiML <Stream/>, WebSocket bridge, forwarding audio bidirectionally.
ericrisco/twilio-realtime-openai-rag
Incoming calls with WebSocket streaming between Twilio and OpenAI Realtime. Includes RAG-style backends, system prompts, and simulated business logic.
prakharbhardwaj/twilio-deepgram-voice-assistant
Twilio calls stream audio via Media Streams to a WebSocket endpoint, which pipes into Deepgram's Voice Agent API and streams responses back.
livekit/agents
Official LiveKit Agents framework. Build AI servers that see, hear, and speak in real time. Supports SIP telephony, plug-in STT/LLM/TTS combos, and worker pool distribution.
happyrobot-ai/livekit-agents
LiveKit Agents fork with Deepgram STT + GPT-4.1 mini + Cartesia TTS. Includes telephony-optimized noise cancellation and multilingual turn detection.
Microsoft VibeVoice-Realtime-0.5B
Open-source MIT-licensed TTS model achieving ~300ms latency via continuous acoustic tokenizer and streaming generation. Replaces paid TTS APIs.
Landscape

Commercial providers
we evaluated

Before building our own, we evaluated the commercial landscape for sub-200ms voice agent platforms. These are the serious players:

Retell AI — Voice-first platform for support/sales agents. Reliable low-latency inbound and outbound calls with good interruption handling.

Telnyx Voice AI — CPaaS + AI stack for real-time agents with full call control. One of the leading voice AI providers for both inbound and outbound automation.

ElevenLabs — High-quality TTS with ~150ms latency for streaming speech. Often used as the TTS component in custom pipelines.

LiveKit + DruidX — Voice agents built on LiveKit with sub-200ms response latency. DruidX emphasizes LiveKit's low-latency infrastructure.

We are building our own because we need 24/7 operation at scale without per-minute costs, full data privacy for client conversations, and the ability to customize every layer of the stack for our specific use cases — real estate lead handling, divorce intake, and appointment scheduling.

Want the results
without the build?

We are offering AI voice agent services for businesses who want sub-200ms phone agents without building the infrastructure themselves. Same hardware, same latency, managed by our team.

Get in Touch