Apache DataFusion 51.0.0 Released (datafusion.apache.org)

0 points 10 hours ago ago | visit original

🤖 AI Summary

Apache DataFusion 51.0.0 is released, driven by 128 contributors, delivering broad performance and usability improvements for data-centric and ML pipelines built on Rust + Apache Arrow. The release sharpens execution speed (ClickBench shows notable gains) by optimizing CASE expression evaluation (earlier short‑circuiting, partial-result reuse, less scattering) and upgrading to Arrow Rust 57.0.0 for much faster Parquet footer/metadata parsing — a big win for workloads with many small files or latency-sensitive startups. DataFusion now defaults to fetching the last 512 KB (configurable via datafusion.execution.parquet.metadata_size_hint) of remote Parquet files to usually capture footer metadata in a single request and avoid extra I/O. New features improve ergonomics and observability: Decimal32/Decimal64 support, PostgreSQL-style SQL pipe operator (|>), DESCRIBE on arbitrary queries (returns schemas), and named arguments in scalar/aggregate/window functions. The datafusion-cli adds object-store I/O profiling (\object_store_profiling) to trace GET/HEAD/LIST requests and summarize durations/sizes—useful for diagnosing remote scan slowness and validating caches. EXPLAIN ANALYZE now exposes richer operator metrics (output_bytes, selectivity, reduction_factor, detailed aggregate/join timings) and a new datafusion.explain.analyze_level setting for concise or verbose reports. These changes make DataFusion faster, more transparent, and easier to integrate into ETL, analytics, and ML systems; consult the upgrade guide and changelog for migration notes.

Loading comments...

loading comments...