Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU (github.com)

🤖 AI Summary
Sopro is a new lightweight text-to-speech (TTS) model featuring 169 million parameters and offering real-time streaming capabilities along with zero-shot voice cloning. Developed as a personal project, Sopro employs a unique architecture using dilated convolutions and lightweight cross-attention layers, diverging from the prevalent Transformer models. Despite some compromises in voice quality and variability, Sopro demonstrates impressive efficiency with a real-time factor of 0.25 on a CPU, allowing for 30 seconds of audio generation in just 7.5 seconds. This is particularly noteworthy considering it was trained on a budget with a single GPU. The model's significance extends to its zero-shot voice cloning feature, which allows for the synthesis of speech in a cloned voice with just 3-12 seconds of reference audio. However, the cloning quality can vary based on microphone quality and environmental noise, highlighting challenges in practical implementation. In the broader AI/ML community, Sopro's alternative architectural choices and performance on limited hardware may inspire further innovation in TTS development, especially for researchers working with constrained resources. Additionally, the project showcases potential avenues for improvement and expansion, such as supporting multiple languages and optimizing performance through better training datasets.
Loading comments...
loading comments...