Music Flamingo: Scaling Music Understanding in Audio Language Models (research.nvidia.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

NVIDIA and University of Maryland researchers released Music Flamingo, a 7B audio–language model and associated datasets designed to push music understanding beyond short captions into layered, theory-aware reasoning. To address scarcity of richly annotated music data they built MF-Skills (~2M full songs, 2.1M multi‑paragraph captions averaging ~452 words, and ~0.9M QA pairs across 100+ genres) and MF-Think, a chain-of-thought corpus of music-theory traces. Music Flamingo ingests up to ~15 minutes of audio with a 24k-token context window, supports on-demand <think> traces, and is available via demo/checkpoints. Key technical advances include extending the Audio Flamingo 3 backbone for a larger memory/receptive field, adding Rotary Time Embeddings so tokens carry absolute timestamps (improving localization of chord changes, solos and lyric entries), supervising intermediate reasoning with MF-Think, and applying GRPO-based reinforcement learning with rewards that favor theory-correct explanations and accurate metadata (tempo/key/chords). The model achieves state-of-the-art results across >10 music benchmarks—large accuracy gains on MuChoMusic (74.6% vs 52.1 prior), dramatic lyric WER drops (Opencpop 12.9% vs ~54% prior), stronger captioning (MusicCaps 8.8 vs 7.2) and improved instrument/genre recognition—while also demonstrating more faithful chord tracking, event localization, and culturally balanced analyses. Music Flamingo sets a practical foundation and benchmark for building models that reason about music as musicians do, not just label it.

Loading comments...

loading comments...