The complete guide to building a production AI call center using a $2,300 gaming PC from Costco. Handles 25 concurrent calls at 170ms latency — 3.8x cheaper than cloud.
We started researching sub-200ms AI voice agents for inbound and outbound phone calls — the kind that can handle real estate leads, divorce intake, appointment booking, and customer support with natural conversation at telephone speed.
The first question was build vs. buy. Cloud GPU servers from providers like RunPod, Lambda, and DigitalOcean run $1,360 to $2,400 per month for a single A100/H100 equivalent — and our workload needs 5 GPUs for 25 concurrent calls. That is $6,800 per month in cloud costs for the same performance a single home machine delivers.
We were originally researching used RTX 3090 builds when we found a smoking deal at Costco: the MSI Aegis Gaming Desktop with an RTX 5080 16GB, AMD Ryzen 9 9900X, 32GB DDR5, and 2TB NVMe — for $2,299.99 ($300 off, member savings valid through March 29, 2026).
The math was not close. A home server with Michigan grid electricity ($0.14/kWh) costs $150/month to operate. Cloud costs $1,350 to $2,550/month for equivalent compute. Cloud is 3.8x more expensive in Year 1 and never breaks even.
Cloud providers use the same Michigan grid electricity — they just mark it up 12x with data center overhead, profit, and multi-tenant management costs.
Sub-200ms on phone calls is achievable when you architect for streaming and parallelism across STT, LLM, and TTS — instead of waiting for full utterances at each stage. Here is the call flow:
To hit near 200ms perceived response time, each stage has a strict budget:
| Stage | Target | How |
|---|---|---|
| Network (caller to gateway) | < 50–70ms | SIP trunk, same-region |
| Gateway to AI server | < 50ms | Localhost (same machine) |
| STT (streaming partials) | < 100ms | Deepgram / local Whisper |
| LLM (time to first token) | 200–400ms | 3–7B model, vLLM, quantized |
| TTS (time to first audio) | 100–300ms | VibeVoice / streaming TTS |
The key insight: these stages run in parallel via streaming, not sequentially. STT sends partial transcripts to the LLM before the user finishes speaking. The LLM starts generating tokens on partial input. TTS starts synthesizing audio from the first tokens. The result is overlapping execution that makes the pipeline feel instantaneous.
Install Asterisk on the server. It handles SIP trunks, inbound DIDs, and outbound dialing. Use ARI (Asterisk REST Interface) + ExternalMedia to create a mixing bridge that pipes RTP audio to your AI process over UDP on localhost. Your Python/Go service decodes PCM, runs it through STT, feeds the LLM, generates TTS audio, and sends PCM back to the bridge. This is the "roll your own Twilio Media Streams" approach — full control, no per-minute fees.
Self-host LiveKit Server on the same machine. It handles WebRTC/SIP sessions and routes low-latency audio to agent processes. LiveKit Agents SDK registers workers that receive audio frames for each call and pipe them through your local STT/LLM/TTS stack. Supports barge-in, turn detection, and multi-participant rooms out of the box.
For real-time voice agents, time-to-first-token (TTFT) is the critical metric — it determines when the response starts streaming to TTS. Target under 300ms for the LLM stage.
3–7B open models (Llama 3.1 8B, Mistral 7B variants) are the sweet spot for self-hosting. They hit 50–400ms TTFT depending on prompt length and quantization. Run them via vLLM or TensorRT-LLM for optimized inference with streaming output.
For TTS, Microsoft VibeVoice-Realtime-0.5B is an open-source MIT-licensed model achieving ~300ms latency via streaming generation — a strong option to avoid paid TTS APIs.
Phone users talk over each other. You need VAD (Voice Activity Detection) + turn detection to allow interrupting the bot and cutting off TTS playback instantly. LiveKit's framework uses Silero VAD and a multilingual turn detector designed for telephony participants.
These are concrete GitHub projects you can clone and run. Each one handles a different piece of the voice agent stack.
Before building our own, we evaluated the commercial landscape for sub-200ms voice agent platforms. These are the serious players:
Retell AI — Voice-first platform for support/sales agents. Reliable low-latency inbound and outbound calls with good interruption handling.
Telnyx Voice AI — CPaaS + AI stack for real-time agents with full call control. One of the leading voice AI providers for both inbound and outbound automation.
ElevenLabs — High-quality TTS with ~150ms latency for streaming speech. Often used as the TTS component in custom pipelines.
LiveKit + DruidX — Voice agents built on LiveKit with sub-200ms response latency. DruidX emphasizes LiveKit's low-latency infrastructure.
We are building our own because we need 24/7 operation at scale without per-minute costs, full data privacy for client conversations, and the ability to customize every layer of the stack for our specific use cases — real estate lead handling, divorce intake, and appointment scheduling.
We are offering AI voice agent services for businesses who want sub-200ms phone agents without building the infrastructure themselves. Same hardware, same latency, managed by our team.
Get in Touch