Vortex: An extensible, state of the art columnar file format (github.com)

🤖 AI Summary
Vortex is a new, open-source columnar file format and toolkit (Linux Foundation LF AI & Data, Apache-2.0) designed for high-performance data systems on object storage. The project claims dramatic gains over modern Parquet—about 100× faster random-access reads, 10–20× faster scans, and 5× faster writes—while matching compression ratios and supporting very wide tables via zero-copy/zero-parse metadata. The file format is now considered stable (backwards compatible from v0.36.0) and the project emphasizes neutral governance and broad ecosystem integrations (Arrow, DataFusion, DuckDB, Spark, Pandas, Polars; Iceberg coming). Technically, Vortex separates logical schema from physical layout and provides zero-copy compatibility with Apache Arrow to eliminate serialization overhead for analytics and ML pipelines. Its pluggable architecture exposes encoding, type, compression and layout strategies, enabling extension encodings (RLE, dictionary, FastLanes, GPU-aware integer compression, FSST strings, adaptive floating-point schemes) and cascading compression for nested data. It ships optimized compute kernels for encoded data, lazy-loaded summary statistics for query planning, a Rust crate and CLI (vx) for inspection, and recommends MiMalloc for allocation. For AI/ML teams this means faster feature extraction, lower I/O and CPU costs, and easier integration into existing Arrow-native stacks—potentially speeding model training and inference data pipelines at scale.
Loading comments...
loading comments...