Cloning a voice at 48 kHz with VoxCPM2 in ElevenLabs API quality (soniqo.audio)

0 points 14 hours ago ago | visit original

🤖 AI Summary

Soniqo has unveiled its latest text-to-speech (TTS) model, VoxCPM2, which can clone voices at an impressive 48 kHz audio quality directly on users' devices. This local processing capability significantly enhances privacy, allows for offline use, eliminates per-call costs, and ensures full ownership of the cloned voice. Users can create bespoke audio content, such as audiobooks narrated in a loved one's voice or maintain vocal consistency in various languages for creators like YouTubers and podcasters. Additionally, those facing voice loss can record a short clip to generate realistic speech through assistive technology. The technical architecture of VoxCPM2 includes a blend of transformers and a novel diffusion process to produce high-fidelity audio. It utilizes a miniature 28-layer language model to determine audio output while the diffusion head enhances acoustic detail. This strategy allows VoxCPM2 to achieve superior audio quality without excessive computational size, standing out against cloud-based options like ElevenLabs, which require internet access and come with associated costs. With three model sizes available, developers can choose based on their needs, making this a versatile tool for anyone in the AI/ML space focused on speech synthesis.

Loading comments...

loading comments...