Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing (xqacmer.github.io)

0 points 7 hours ago ago | visit original

🤖 AI Summary

Researchers introduced Ming-UniAudio, a new family of speech foundation models that unifies speech understanding, generation and editing with a continuous audio tokenizer. Prior speech LLMs either used separate representations for comprehension and synthesis (preventing edit operations) or discrete (quantized) tokens that lose acoustic detail. MingTok-Audio, the continuous tokenizer, and Ming-UniAudio (the first Speech LLM built on it) bridge that gap: they maintain high-fidelity continuous audio representations that serve both semantic understanding and natural waveform generation. On top of that, Ming-UniAudio-Edit is presented as the first model to support free-form, instruction-driven speech editing—covering both semantic (what is said) and acoustic (how it sounds) changes—without being restricted to fixed temporal regimes. Technically, the move to continuous tokens avoids quantization artifacts and lets a single model perform multi-round, fine-grained edits across time (akin to image editing workflows). The authors also release Ming-Freeform-Audio-Edit-Benchmark to standardize evaluation of such edits. For the AI/ML community this advances interactive audio workflows—think conversational repair, voice-cloning correction, or targeted prosody changes—while inviting follow-up work on robustness, controllability and safety (e.g., misuse/voice spoofing mitigation) as these capabilities become more accessible.

Loading comments...

loading comments...