Show HN: An Implementation of DataRater: Meta-Learned Dataset Curation (github.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

This repo implements DataRater (Calian et al.), a meta-learning pipeline that learns per-sample quality scores to reweight training examples and improve robustness to corrupted data. The system runs a two-level optimization: multiple task-specific "inner" models train on data reweighted by a learned DataRater, while an outer loop updates the DataRater based on inner-model validation performance. The code provides abstract dataset and model interfaces (DataRaterDataset, nn.Module task models, and a DataRater scorer), utilities to inject on‑the‑fly corruptions, and a factory/CLI to run meta-training (examples, hyperparameters and a one-line MNIST script included). Sample scores are converted to weights via softmax; the implementation also supports a population of inner models, periodic refresh, gradient clipping, and checkpointing. For practitioners the significance is twofold: it automates dataset curation by distinguishing high- vs low-value samples and can improve robustness with minimal changes to training loops; and it’s reproducible and extensible — you add datasets by subclassing and register new models in the factory. The MNIST corrupted-data experiments (8 inner models, meta/inner step tuning) showed filtered training (dropping lowest 10% by DataRater) slightly outperformed baseline and random drop (0.9732 vs 0.9708 mean accuracy). The main trade-offs are extra compute from meta-optimization and hyperparameter sensitivity (inner/outer lrs, meta_steps, num_inner_models), but the code and saved checkpoints enable further tuning and transfer to other datasets.

Loading comments...

loading comments...