Local Text-to-Speech (TTS) and Voice Cloning with Mlx-Audio (blog.johnys.io)

0 points 12 hours ago ago | visit original

🤖 AI Summary

Local, high-quality text-to-speech and voice cloning are becoming practical on everyday laptops thanks to the mlx-audio library and lightweight models like Prince Canuma’s Marvis-TTS. These tools let you generate natural-sounding speech and clone voices from short reference clips (often ~10 seconds) without sending audio to the cloud, enabling low-latency, privacy-preserving demos and workflows. The result: developers, researchers and product teams can prototype voice interfaces, personalized voice-overs, and on-device TTS with minimal infra overhead. Getting started is straightforward — create a Hugging Face token, use the minimal uv project setup, and call mlx_audio.tts.generate with models such as Marvis-AI/marvis-tts-250m, prince-canuma/Kokoro-82M or mlx-community/csm-1b (the latter supports --ref_audio for cloning). Typical sampling settings shown are temperature ~0.4, top_p 0.9, top_k 50; you can also stream output (--stream) or run a local web server (mlx_audio.server). Quality can vary (artifacts, startup “slams”, or mismatch to reference); fixes include cleaning reference audio with ffmpeg, keeping clips near 10s, rerunning stochastic generations, and tweaking sampling params or trying alternate models. For the AI/ML community this means more reproducible, private voice research and faster iteration on multimodal products without relying on cloud black boxes.

Loading comments...

loading comments...