Delta Lake 4.0.0 (github.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Delta Lake 4.0.0 is the new major release built on Apache Spark 4.0, introducing several features that accelerate and harden lakehouse workflows for analytics and ML. Highlights include Delta Connect (preview) to make Delta usable over Spark Connect’s decoupled client‑server model, preview support for catalog‑managed tables (catalog‑brokered commits and policy enforcement), GA support for the new Variant data type for semi‑structured/schema‑on‑read workloads, and a preview of shredded variants (Parquet-based subfield storage with up to ~20x read speedups). Type widening is now production-ready, and DROP FEATURE can remove table features instantly without truncating history. Under the hood Kernel and Spark changes focus on performance, consistency, and integration: Kernel now reads/writes per‑commit version checksums and log‑compaction files (faster snapshot construction), exposes clustered‑table metadata, and supports writing file statistics to the Delta log (enables aggressive file pruning/data skipping). Row‑tracking writes, post‑commit hooks for compaction/checksum, enhanced transaction APIs, and improved feature upgrade flows make Delta more robust for production ML pipelines. Note: Delta Standalone and its dependent connectors are being sunsetted in favor of Kernel-based integrations; several new capabilities are still preview/RFC and should not be enabled in production until stabilized. Overall, 4.0 sharpens Delta’s scalability, flexible semi‑structured handling, and remote usage—direct wins for feature stores, model training, and large-scale data prep.

Loading comments...

loading comments...