Mutable atomic deletes with Parquet backed columnar tables on S3 (www.shayon.dev)

🤖 AI Summary
This post describes a practical technique to make selective, atomic deletes inside Parquet files stored on S3 without downloading and rewriting unchanged bytes. Instead of tombstones, visibility is controlled by a tiny CAS-updated manifest that points to immutable Parquet objects. To delete rows inside a single row group, the client range-reads only the footer and the target row group, decodes and re-encodes the group (updating stats and row counts), then uses S3 Multipart Upload with UploadPartCopy to assemble a new object: server-side copy(prefix) → upload(edited row group) → copy(suffix) → upload(new footer+PAR1) and CompleteMultipartUpload. Readers pin to the manifest/ETag so scans see a coherent snapshot; the manifest CAS flips visibility atomically. Key technical implications: this localizes data movement (no egress for copied bytes), keeps Parquet layout intact by using footer offsets (rg_start, rg_end, footer_start) and shifting subsequent group offsets by delta = new_rg_size−old_rg_size, and preserves scan performance by packing files ~128–256 MiB with ~8–16 MiB row groups. Operational caveats include S3 MPU rules (non-last parts ≥ 5 MiB; CopySourceRange is inclusive), treating MPU ETags as version tokens (not MD5), assembling to a new key and aborting failed MPUs, and regenerating footer/index/column stats correctly. Garbage collection of orphaned versions and batching edits per row group are recommended to bound latency and cost.
Loading comments...
loading comments...