Reducing Data Lake Storage by 52% with SciPy-Style Sparse Arrays in DuckDB (blog.sturdystatistics.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Sturdy Statistics converted large, sparse topic-embedding arrays in DuckDB into a SciPy-style sparse format (indices + float32 values, with dimensionality moved to table metadata) and added lightweight SQL wrappers (sparse_list_extract/select, dense_x_sparse_dot_product, sparse_to_dense) so sparse arrays can be used like native lists. They decided to store indices and values in separate columns to exploit DuckDB’s columnar layout (faster index-only filtering, smaller reads) and implemented conversion helpers in NumPy (indices cast to integer types, values to float32). On real production data this yielded dramatic storage wins—an average 52% reduction in DuckDB file sizes, with the largest datasets benefiting most—despite synthetic tests being misleading because DuckDB’s Snappy compression already handles long runs of zeros well. Performance was nuanced: search and retrieval stayed unchanged when operations used sparse-aware dot-products, but naïve dense conversions slowed some analytics 2–5x. By reworking analytics to use UNNEST on the compact sparse representation, they reduced row expansion factors (from ~500x to 3–8x) and achieved 2–10x speedups on heavier workloads. The work highlights practical trade-offs and argues for native sparse-array support in columnar engines to reduce storage and accelerate sparse-linear algebra workloads in ML pipelines.

Loading comments...

loading comments...