F3: Open-source data file format for the future [pdf] (db.cs.cmu.edu)

0 points 20 hours ago ago | visit original

🤖 AI Summary

Researchers from Tsinghua, CMU and industry collaborators unveiled F3, an open-source “future-proof” columnar file format designed for modern data and ML workloads. F3 rethinks file layout and interoperability by making files self-describing (data + metadata) and embedding the actual decoder implementations as WebAssembly (Wasm) binaries inside each file. That plug-in API means any reader — regardless of language or library version — can decode new encodings without upgrading system libraries, solving long‑standing version-skew and extensibility problems that have hampered Parquet/ORC deployments. Technically, F3 separates I/O, encoding, and dictionary units (replacing monolithic rowgroups), minimizes metadata needed to read subsets of columns, and supports cascading compression and vectorized decoding. Files reserve space for future indexing structures and use an API that compiles the same decoder to native and Wasm targets. Embedding Wasm costs only kilobytes of storage and imposes a modest decode slowdown (≈10–30%) versus native; overall throughput and space efficiency match state-of-the-art formats in the authors’ evaluations. For AI/ML teams, F3 enables safe rollout of new encodings (e.g., for high-dimensional features, embeddings, or specialized compressors) across heterogeneous fleets without breaking interoperability, accelerating innovation in data storage and processing.

Loading comments...

loading comments...