Evolutionary Data Making – How to train embedding models (wafer.systems)

🤖 AI Summary
Wafer has unveiled an innovative approach to training embedding models for its mobile operating system's search functionality, addressing the limitations of traditional static data generation methods. By employing evolutionary principles, Wafer developed a dynamic system where frontier large language models (LLMs) explore and refine data generation policies based on quality metrics articulated in their "Good Data Manifesto." This technique involves using LLMs as evolutionary operators to generate training data that is more nuanced and contextually relevant. As a result, a 0.6 billion parameter model trained on this data achieved a 37% improvement in NDCG@10 and excelled in blind comparisons against real user queries. The significance of this advancement lies in its potential to revolutionize how training datasets for retrieval models are constructed, particularly in the opaque landscape of current embedding model development. By openly documenting their methodology, Wafer aims to foster transparency and collaboration within the AI community, which has been marked by a lack of shared practices and data. This evolutionary search framework enables the exploration of complex associations between diverse data sources, such as emails, calendars, and messaging apps, allowing users to derive meaningful insights that would otherwise require extensive manual effort. The introduction of a constitutional evaluation framework to guide data generation further enhances the quality and relevance of the training data, setting a new standard for the future of AI training methodologies.
Loading comments...
loading comments...