Speech-to-Retrieval (S2R): A new approach to voice search (research.google)

🤖 AI Summary
Google Research today announced Speech-to-Retrieval (S2R), a production voice-search architecture that skips textual transcription and retrieves answers directly from spoken queries. S2R addresses a key weakness of cascade systems—small ASR errors that change intent (e.g., “The Scream” → “screen”)—by asking “what information is being sought?” rather than “what words were said?” The system is already live in multiple languages and substantially improves search accuracy over conventional ASR-then-retrieve pipelines. Technically, S2R uses a dual-encoder design: an audio encoder that converts raw speech into dense query vectors and a document encoder that embeds content into the same vector space. Trained on paired audio-query and relevance labels, the model pulls semantically close documents via nearest-neighbor retrieval and then hands candidates to the existing ranking stack for final scoring. Evaluation on the newly open-sourced Simple Voice Questions (SVQ) dataset (17 languages, 26 locales) shows S2R significantly outperforms cascade ASR and approaches a “perfect ASR” upper bound (measured by MRR), revealing that WER alone poorly predicts downstream retrieval quality. Google is releasing SVQ as part of the Massive Sound Embedding Benchmark to spur research; while S2R narrows the gap, authors note remaining headroom and invite community innovation.
Loading comments...
loading comments...