New SoTA open source TTS model from Boson AI (huggingface.co)

🤖 AI Summary
Boson AI has launched the Higgs Audio v3 Text-to-Speech (TTS) model, marking a significant advancement in voice chat technology. This new model excels in generating expressive conversational speech in over 100 languages, offering features such as zero-shot voice cloning and real-time control over emotion, style, prosody, and sound effects. Unlike traditional TTS systems that merely read text, Higgs Audio v3 is designed to deliver conversations that are more engaging and realistic. It's available for research and non-commercial purposes under a specific license, while commercial applications require separate licensing, strictly prohibiting unethical uses like impersonation or biometric surveillance. From a technical perspective, this autoregressive TTS model employs an interleaved input of text and audio tokens processed through an advanced multi-codebook approach. It boasts a context length of 8,192 tokens and achieves impressively low word error rates (WER) across multiple languages, demonstrating high fidelity and expressive capabilities. The system integrates control tokens for emotional nuance and delivery style directly into the input, allowing for rich auditory experiences. Higgs Audio v3's strong performance in multilingual benchmarks and enhanced vocal expressiveness underscores its potential to reshape user interactions in AI-driven applications.
Loading comments...
loading comments...