Open source speech foundation model that runs locally on CPU in real-time (huggingface.co)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Neuphonic announced NeuTTS Air, an open‑source, on‑device TTS foundation model that delivers super‑realistic speech and instant voice cloning in real time on CPU. Built around a lightweight 0.5B LLM backbone (Qwen 0.5B) and a proprietary neural codec (NeuCodec), NeuTTS Air runs in GGML/GGUF formats so it can synthesize natural, human‑quality voices on phones, laptops and even Raspberry Pis without cloud APIs. The system can clone a speaker from as little as 3 seconds of reference audio, accepts a reference .wav plus text, and outputs 24 kHz audio; example code and a GitHub repo make it straightforward to try locally (Python >=3.11, espeak dependency; PyTorch is optional when using GGML/ONNX). Technically, the project targets the “sweet spot” between latency, model size and fidelity: a 0.5B backbone for fast text understanding/generation, a single‑codebook NeuCodec for low‑bitrate high‑quality audio, and GGML inference for CPU real‑time performance and low power. Outputs are watermarked (Perth watermark) to aid provenance and compliance. Significance: NeuTTS Air democratizes high‑quality voice AI for embedded agents and privacy‑sensitive apps by removing cloud dependence, but its instant cloning capability also raises misuse risks—mitigated somewhat by watermarking and local control—making responsible deployment and governance critical.

Loading comments...

loading comments...