Show HN: Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX) (github.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

A new open-source text-to-speech (TTS) model, VITS EVOlution, has been announced, showcasing real-time capabilities, zero-shot voice cloning, and voice blending. This stack utilizes a combination of an ONNX speaker encoder and TTS inference, leveraging DeepPhonemizer for phoneme processing and offering permissive licenses for easy integration. Significantly, it achieves a remarkable inference rate of approximately 5.6 times faster than real-time on an Intel Xeon Platinum processor, making it suitable for applications that require low-latency audio output. The VITS EVOlution model allows users to clone voices from a single reference clip and blend multiple speaker embeddings, generating unique voice outputs. Requiring just a few steps to set up—by downloading the model and necessary components—developers can easily run demos through Gradio, enabling web-based interaction and voice experimentation. This advancement not only enhances accessibility and personalization in TTS applications but also underscores the growing trend towards real-time, high-quality voice synthesis in the AI/ML community, paving the way for innovative use cases in various sectors such as gaming, virtual assistants, and content creation.

Loading comments...

loading comments...