Real-time speech-to-speech translation (research.google)

🤖 AI Summary
Google DeepMind has announced an innovative end-to-end speech-to-speech translation (S2ST) model capable of real-time translation with just a 2-second delay, a significant advancement over existing systems that typically experience delays of 4-5 seconds. By generating translated audio in the original speaker's voice, this technology enhances natural communication across language barriers, addressing the shortcomings of previous approaches that often relied on cascaded processing stages leading to accumulated errors and a lack of personalization. The new S2ST model employs a scalable data acquisition pipeline that synchronizes audio input and translation processes, while also using a unique streaming architecture based on the AudioLM framework to manage continuous audio streams. Key technical features include real-time audio representation through RVQ audio tokens and adjustable prediction delays tailored for dynamic conversation needs. Initially launched in Google Meet and on new Pixel 10 devices, this technology promises to revolutionize cross-language communication by providing robust performance for several Latin-based languages, with plans for future expansion into additional languages, enhancing the fluidity and context-awareness of translations.
Loading comments...
loading comments...