TTS Still Sucks (duarteocarmo.com)

🤖 AI Summary
A creator trying to convert their blog into a podcast tested the current crop of open-source TTS and voice-cloning models and concluded: open TTS still disappoints. They evaluated top leaderboard entrants—Kokoro (notable for being only 82M parameters and 360 MB but lacking voice-cloning), Fish Audio’s S1‑mini (emotion markers, long pauses and chunking support often missing or gated to a closed version), and Chatterbox (better than F5‑TTS but hamstrung by short output limits). Common failure modes were hallucinations, squeaks, and runaway speaking speed once clips exceeded ~1,000–2,000 characters; many control features (emotion tags, <pause> tokens, chunking parameters) were unreliable or ignored. Technically, their production pipeline is: RSS → LLM for transcript cleanup, summary and show-note links → chunking → parallel Modal containers running Chatterbox TTS → WAV stitching → S3 hosting (now also distributed to Spotify). Practical workarounds include splitting text into one sentence per line to stabilize generation. The takeaway for the AI/ML community: while model efficiency and small-footprint wins exist (Kokoro), open-source voice cloning and long-form TTS lack robustness and fine-grained control compared to gated commercial offerings. This highlights an urgent need for better duration handling, consistent control token implementations, and truly open cloning-capable models; the author’s pipeline and code are available on GitHub for reuse and further experimentation.
Loading comments...
loading comments...