StutterZero: Speech Conversion for Stuttering Transcription and Correction (arxiv.org)

🤖 AI Summary
Researchers introduced StutterZero and StutterFormer, the first end-to-end waveform-to-waveform systems that convert stuttered speech into fluent speech while simultaneously producing transcriptions. StutterZero uses a convolutional + bidirectional LSTM encoder-decoder with attention; StutterFormer uses a dual-stream Transformer that learns shared acoustic–linguistic representations. Both were trained on paired stuttered→fluent data synthesized from SEP-28K and LibriStutter and tested on unseen speakers from FluencyBank. Against a strong Whisper‑Medium baseline, StutterZero reduced Word Error Rate (WER) by 24% and improved semantic similarity (BERTScore) by 31%; StutterFormer did better, cutting WER by 28% and raising BERTScore by 34%. The work is significant because it avoids conventional multi-stage ASR→text editing→TTS pipelines and handcrafted features, reducing reconstruction artifacts and enabling integrated audio+text correction. Technical implications include proof that direct waveform-to-waveform conversion with joint transcription is feasible and beneficial for accessibility, real-time assistive interfaces, and speech-therapy tools. Limitations remain: training on synthesized paired data raises questions about performance on naturally occurring stutters and diverse accents, and practical deployment must address latency, user consent, voice authenticity, and ethical risks of altering speakers’ vocal identity.
Loading comments...
loading comments...