Open (Apache 2.0) TTS model for streaming conversational audio in realtime (github.com)

🤖 AI Summary
Nari Labs released Dia2, an open-source (Apache 2.0) streaming dialogue TTS model designed for real‑time conversational audio. Unlike offline TTS, Dia2 can begin synthesis after only a few input words and can be conditioned on short audio prefixes (e.g., assistant and user clips) so responses sound natural in back‑and‑forth speech. The project provides 1B and 2B parameter checkpoints, a JAX/Bonsai implementation, inference CLI and Gradio app on Hugging Face Spaces, a Dia2 TTS server for true streaming, and Sori — a Rust speech‑to‑speech engine powered by Dia2. Models are limited to ~2 minutes of generated audio (max_context_steps 1500) and output audio tokens, waveform tensors, and word timestamps at Mimi’s ~12.5 Hz frame rate. Technically, Dia2 auto-selects CUDA when available (requires CUDA 12.8+ drivers), defaults to bfloat16 precision, and supports performance features like CUDA graphs; the CLI and Python API expose GenerationConfig/SamplingConfig (cfg_scale, temperature, top_k) for tuning. Conditional generation relies on prefix audio transcribed with Whisper for context. The release accelerates research into low‑latency voice agents, speech‑to‑speech systems, and interactive assistants by providing ready code and checkpoints, but includes strict ethical rules banning identity mimicry, deceptive content, and malicious uses.
Loading comments...
loading comments...