Parquet Content-Defined Chunking (huggingface.co)

0 points 2 days ago ago | visit original

🤖 AI Summary

Hugging Face has integrated Apache Arrow’s Parquet Content-Defined Chunking (CDC) into its new Xet storage layer, significantly optimizing Parquet file storage and transfer on the Hugging Face Hub. This development allows datasets to be uploaded and downloaded more efficiently by enabling chunk-level deduplication: only modified data chunks are transferred rather than entire files. Given that Parquet files account for over 4 PB of data on the Hub, this enhancement dramatically reduces storage costs and bandwidth usage, making large-scale data workflows more scalable and performant. Technically, CDC improves deduplication by leveraging Parquet’s columnar layout to identify byte-level similarities even when minor data changes occur, overcoming traditional file-level deduplication limitations. The feature is available directly in PyArrow and Pandas through a simple API flag and works seamlessly with Hugging Face’s content-addressable Xet storage, deduplicating data across different repositories as well. Practical tests on large datasets like OpenOrca demonstrate that adding or removing columns, changing data types, or appending rows results in uploading only the incremental changes rather than full files, yielding substantial upload speedups and reduced transfer sizes. This enhancement unlocks efficient collaborative workflows and incremental dataset updates without the overhead of re-transmitting large files, a significant leap forward for AI/ML practitioners dealing with massive, evolving datasets. By combining Parquet CDC with Xet, Hugging Face is setting a new standard for data versioning and management in the AI ecosystem.

Loading comments...

loading comments...