Pipes: A Meta-Dataset of Machine Learning Pipelines (arxiv.org)

🤖 AI Summary
Researchers released PIPES, a purpose-built meta-dataset of machine-learning pipelines that aims to fill gaps left by repositories like OpenML. Where OpenML’s records often concentrate on a few popular preprocessing and modeling choices, PIPES systematically executes and records 9,408 distinct pipeline combinations across 300 datasets, explicitly covering permutations of pipeline “blocks” such as scaling, imputation and model choices. For each experiment it stores pipeline configuration, training and testing times, full predictions, performance metrics and error traces, and the code/data are publicly available for extension. PIPES is significant because it provides a more balanced, comprehensive substrate for meta-learning, algorithm selection and AutoML research: by removing sampling bias toward popular techniques, researchers can better study how preprocessing choices interact with models, build more reliable meta-models of performance, and benchmark selection strategies across diverse pipelines. The dataset’s breadth and machine-readable traces also support robustness analyses (e.g., failure modes), time-aware selection (training/test runtimes), and reproducible experimentation, making it a practical resource for improving model selection and pipeline-aware AutoML systems.
Loading comments...
loading comments...