1 second voice-to-voice latency with all open models (modal.com)

🤖 AI Summary
Modal + Pipecat engineers demonstrated a full voice-to-voice conversational bot that hits ~1 second response latency using only open-weight models and open frameworks. They built a Pipecat pipeline (WebRTC transport → STT → RAG/LLM → TTS → WebRTC) orchestrated by Pipecat’s stateful processors (SmallWebRTCTransport, Silero VAD + SmartTurn for turn-taking) and deployed inference as independent Modal services that autoscale. The result is a vendor-neutral, low-cost architecture that supports multi-turn interruption handling and real-time UX while remaining reproducible from their GitHub repo. Key technical choices and optimizations enabled the latency: STT with NVIDIA parakeet-tdt-0.6b-v3 on segmented audio (faster final transcripts than streaming alternatives); a small, fast LLM (Qwen3-4B-Instruct-2507) served with vLLM and CUDA-graph tuning to minimize time-to-first-token; KokoroTTS (82M) for streaming audio output and correct phonetics; RAG via ChromaDB + all-minilm-l6-v2 with OpenVINO for sub-100ms retrieval. Network tactics—including Pipecat WebRTC for client↔bot, Modal tunnels to reduce input-plane hops, and separating long-running CPU bot containers from GPU autoscaled services—are emphasized as essential engineering patterns for real-time AI. The post is significant for the AI/ML community because it documents practical, open-source end-to-end patterns for sub-second conversational latency and the trade-offs (TTFT vs cold-start, batching vs streaming) required to achieve it.
Loading comments...
loading comments...