🤖 AI Summary
Spotify describes a hybrid, production-ready approach for personalizing "agentic" music recommendation — LLM-based agents that interpret situational queries (e.g., “solo night drive”), generate DSL orchestration plans to call search/filter tools, and synthesize playlists — then learn continuously from plays, skips, saves and other signals. The core challenge addressed is credit assignment: users only give feedback on final playlists, not the orchestration that produced them, while preferences and the catalog evolve. Rather than periodic batch retraining or brittle RL, Spotify pairs a calibrated reward model (RLHF-inspired) that predicts long-term satisfaction for a given user/query/playlist with Direct Preference Optimization (DPO) to fine-tune the agent end-to-end.
Technically, the system runs a preference-tuning flywheel: generate diverse executable DSL plans, score candidates with the reward model, sample high-signal preference pairs using margin constraints and hard negatives, and fine-tune via DPO. Key practices include training on playlists (not plans), bucketing pairs by margin difficulty to avoid overfitting, using hard negatives to sharpen boundaries, and investing in infra (tool-pool sizing, caching) for efficiency. In A/B tests this produced meaningful gains (+4% listening time, more saves) and operational wins (70% fewer erroneous tool calls). The approach demonstrates a scalable, maintainable path to aligning agentic recommender systems with nuanced, intent-driven user preferences.
Loading comments...
login to comment
loading comments...
no comments yet