🤖 AI Summary
Kani TTS is an open-source, real-time Text-to-Speech system that packs high fidelity into a compact 370M-parameter model. It uses a two-stage pipeline where a backbone LLM generates compressed audio tokens which a neural audio codec decodes into waveforms, enabling extremely low latency—about 1 second to produce 15 seconds of 22 kHz audio on an NVIDIA RTX 5080 with only ~2 GB VRAM. Licensed under Apache 2.0 and pretrained on ~80k hours from sources like LibriTTS and Common Voice, Kani supports multiple languages (English, German, Chinese, Korean, Arabic, Spanish), yields MOS ~4.3/5 for naturalness and WER <5% for accuracy, and ships with several ready voices for conversational use.
Technically this design emphasizes edge- and server-friendly inference (optimized for NVIDIA Blackwell GPUs) and fast fine-tuning workflows (trained on 8× H100s in ~45 hours). Key implications: it makes deployable, low-cost real-time speech feasible for chatbots, accessibility tools, and research into voice adaptation, while allowing community-driven improvements thanks to permissive licensing. Limitations include degraded performance on inputs >2000 tokens, reduced expressivity without targeted fine-tuning, and potential prosody/pronunciation biases from training data—areas where further language-specific pretraining and NanoCodec fine-tuning are recommended.
Loading comments...
login to comment
loading comments...
no comments yet