3x-9x Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser in Rust (arrow.apache.org)

0 points 6 days ago ago | visit original

🤖 AI Summary

arrow-rs’ parquet Rust crate (v57.0.0) ships a custom Apache Thrift parser that decodes Parquet footer metadata 3×–9× faster than the previous generated-parser implementation. The improvement comes from eliminating intermediate generated structures and heap allocations: the new parser decodes Thrift Compact bytes directly into the final in-memory representation, supports selective parsing and field-skipping, and uses hand-optimized hot paths and careful allocation strategies. Benchmarks show consistent speedups across string and floating-point metadata and across wide tables with many columns — all without changing the Parquet file format. This matters because footer parsing is on the critical I/O path for Parquet readers, scales linearly with columns and row-groups, and can dominate latency for fast storage and latency-sensitive workloads (interactive analytics, observability, and RAG pipelines feeding LLMs). Rather than attempt a disruptive format change (e.g., FlatBuffers), this engineering-first approach yields large real-world gains. To keep the custom parser maintainable as parquet.thrift evolves, the team adopted a Rust macro-based system (serde-like #[derive] style) that maps Thrift definitions to annotated Rust structs so most code can be generated while allowing hand-tuned parsers for performance-critical structures.

Loading comments...

loading comments...