Distributed pipelines for handling PBs of image data (www.amplifypartners.com)

🤖 AI Summary
Datology has productized years of research into distributed pipelines that let researchers and customers curate petabyte-scale image-text corpora (they operate on ~5–6 PB total; a recent run used DataComp XL’s 12.8B text-image pairs ≈600 TB) to improve model quality, cost, and speed without changing model architectures. Their approach underlines a big shift in AI: with internet-scale pretraining data effectively finite, smarter data selection—deduplication, filtering, embedding, clustering and scoring—can yield SOTA results (Datology reports faster, cheaper, smaller CLIP models) and make pretraining more reproducible and deployable across real customer environments. Technically, the team built reusable PB-scale operators and a custom orchestrator (built on Flyte + their data catalog) that run on heterogeneous Kubernetes clusters (from H100-rich labs to GPU-less customer clouds). A standout example is their exact-image deduplication: naive O(N^2) all-pairs comparison is infeasible, so they compute compact hashes, co-locate similar hashes via a tuned shuffle/hash-join pattern, score conflicting text-image pairs to pick the best caption, then perform a two-step index-and-filter join to avoid full dataset shuffles or broadcasts. This engineering—novel join/indexing patterns, heavy Spark/Ray tuning, and productionized deployment—turns once-intractable curation steps into scalable primitives that materially change how the community can squeeze more signal from existing data.
Loading comments...
loading comments...