🤖 AI Summary
Drax is a new non-autoregressive (NAR) ASR framework that uses discrete flow matching to enable efficient parallel decoding while retaining state-of-the-art recognition accuracy. Instead of the usual diffusion-style training that transitions from random noise to the target, Drax constructs an audio-conditioned probability path that guides the model along trajectories resembling the likely intermediate inference errors. This audio conditioning better aligns training and inference, so the model learns to correct realistic mistakes during parallel generation rather than coping with artificial noise schedules.
Technically, the authors connect generalization gaps to divergences between training and inference occupancies and show these are controlled by cumulative velocity errors in the flow-matching dynamics—motivating the audio-conditioned path design. Empirically, Drax matches accuracy of top autoregressive speech models while offering a superior accuracy-efficiency trade-off, demonstrating that discrete flow matching is a viable direction for scalable, low-latency ASR. The approach suggests broader opportunities for applying diffusion/flow techniques from large language models to speech tasks, particularly where aligning training trajectories with real inference behavior reduces mismatch and improves parallel decoding performance.
Loading comments...
login to comment
loading comments...
no comments yet