Show HN: Kani TTS – Open-source fast TTS with just 370M params (huggingface.co)

0 points 17 hours ago ago | visit original

🤖 AI Summary

Kani TTS is an open-source, real-time Text-to-Speech system that packs high fidelity into a compact 370M-parameter model. It uses a two-stage pipeline where a backbone LLM generates compressed audio tokens which a neural audio codec decodes into waveforms, enabling extremely low latency—about 1 second to produce 15 seconds of 22 kHz audio on an NVIDIA RTX 5080 with only ~2 GB VRAM. Licensed under Apache 2.0 and pretrained on ~80k hours from sources like LibriTTS and Common Voice, Kani supports multiple languages (English, German, Chinese, Korean, Arabic, Spanish), yields MOS ~4.3/5 for naturalness and WER <5% for accuracy, and ships with several ready voices for conversational use. Technically this design emphasizes edge- and server-friendly inference (optimized for NVIDIA Blackwell GPUs) and fast fine-tuning workflows (trained on 8× H100s in ~45 hours). Key implications: it makes deployable, low-cost real-time speech feasible for chatbots, accessibility tools, and research into voice adaptation, while allowing community-driven improvements thanks to permissive licensing. Limitations include degraded performance on inputs >2000 tokens, reduced expressivity without targeted fine-tuning, and potential prosody/pronunciation biases from training data—areas where further language-specific pretraining and NanoCodec fine-tuning are recommended.

Loading comments...

loading comments...