Show HN: Annotation to document and runtime validate Pandas and Polars dataframe (github.com)

0 points 23 hours ago ago | visit original

🤖 AI Summary

Daffy is a small Python library that adds lightweight runtime validation and self-documenting contracts to Pandas and Polars DataFrame pipelines via simple decorators (df_in / df_out). By declaring expected columns and types on function entry/exit, Daffy acts like type hints for tabular data—helping catch structural mismatches early, keeping pipeline documentation in sync with code, and surfacing clear errors when schemas diverge. It supports regex column name matching (e.g., r/column_\d+/), configurable strictness around extra columns, and improved IDE/type-checker annotations. Beyond cheap structural checks, Daffy integrates optional row-level validation using Pydantic (>=2.4) for value-level guarantees, including batch validation and cross-field rules with informative failure reports that indicate which rows failed and why. Column validation is designed to be essentially free; row validation is costlier but optimized—benchmarks on an M1 Pro show ~770K rows/sec for simple validation and ~165K rows/sec for complex models (32 columns, missing values, cross-field logic). The project offers pyproject.toml configuration, integrated logging for schema inspection, a development guide, and installation via pip (MIT license). The tradeoff is explicit: adopt cheap, near-zero-overhead column checks by default and enable Pydantic-based row checks when stronger data correctness is required.

Loading comments...

loading comments...