Introduction to program synthesis pt 2: Automatically generating features for ML (mchav.github.io)

0 points 3 hours ago ago | visit original

🤖 AI Summary

The post presents Part II of a hands‑on series on program synthesis, applying an enumerative bottom‑up search to automatically generate mathematical features for the Iris dataset. Rather than seeking an exact program that matches examples, the author frames symbolic regression as program synthesis: enumerate expressions from a small DSL (Col, Lit, UnaryOp, BinaryOp), expand them with transforms (sqrt, abs, log(+1), exp, sin, cos, relu, signum), powers and arithmetic, deduplicate/normalize candidates, and pick the best approximating formulas to plug into a neural net. Key engineering decisions include handling noisy or incomplete targets (approximate correctness to avoid overfitting), prohibiting NaN/Inf results, and simplifying expressions via custom rewrite rules (not full e‑graphs) to control redundancy. To tame combinatorial explosion the prototype uses a greedy beam search: deduplicate and evaluate expressions on a training split, rank candidates by Pearson correlation (or configurable loss), keep the top‑N (beamLength, e.g. 4) each depth, and iterate until a search depth limit. The implementation is demonstrated in a Haskell dataframe/Hasktorch context (generatePrograms, deduplicate, pickTopN, beamSearch), showing a practical tradeoff between DSL expressiveness and search tractability. For the AI/ML community this illustrates a reproducible, interpretable route to automated feature engineering—useful for small‑data scientific problems or hybrid pipelines—while highlighting where more advanced tools (e‑graphs, genetic programming, learned oracles) would be needed for larger, more expressive syntheses.

Loading comments...

loading comments...