Can text be made to sound more than just its words? (2022) (arxiv.org)

🤖 AI Summary
This paper proposes making captions carry paralinguistic information by visually encoding vocal prosody into typography. The authors extract three prosodic features—loudness, pitch, and duration—and map them to font-weight, baseline shift, and letter-spacing, respectively, producing "speech‑modulated typography" that can be shown as static or animated text. In a user study (n=117) participants were asked to match typographic renderings to their source audio among similar alternatives and identified the correct audio at 65% on average; performance did not differ significantly between animated and static renderings. Qualitative feedback revealed widely varying mental models for how typographic cues relate to vocal qualities. For AI/ML and accessibility communities, this work demonstrates a lightweight, explainable way to surface paralinguistic cues lost by conventional captions—useful for deaf/hard-of-hearing readers, richer conversational interfaces, and multimodal ASR/TTS systems. Technically it suggests an end-to-end pipeline: prosody extraction from audio, parameter mapping into typographic features, and in-line rendering. Results are promising but modest, highlighting the need for standardization of mappings, cross-cultural validation, training or legend affordances for readers, and integration with real-time captioning systems to evaluate practical benefits and robustness.
Loading comments...
loading comments...