Show HN: CellARC Measuring Intelligence with Cellular Automata (arxiv.org)

🤖 AI Summary
Researchers released CellARC, a new synthetic benchmark that measures abstraction and reasoning using multicolor 1D cellular automata. Each episode is serialized into 256 tokens and contains five support pairs plus a query, letting researchers run rapid experiments with small models. Crucially, CellARC exposes explicit knobs for alphabet size (k), neighborhood radius (r), rule family, Langton’s lambda, query coverage, and cell entropy, enabling controlled difficulty and reproducible sampling. The authors publish 95k training episodes and two 1k test splits (interpolation and extrapolation) and evaluate a wide range of approaches—symbolic, recurrent, convolutional, transformer, recursive, and large closed LLMs—so the benchmark can probe how models infer new dynamical rules under tight compute and data budgets. Key findings show that inexpensive architectures can be very competitive: a 10M-parameter vanilla transformer achieved 58.0%/32.4% per-token accuracy on interpolation/extrapolation, outperforming recent recursive models (TRM, HRM). A large closed model (GPT-5 High) reached 62.3%/48.1% on selected tasks, and an ensemble that picks per episode between the transformer and the best symbolic method hit 65.4%/35.5%, highlighting neuro-symbolic complementarity. By decoupling generalization from anthropomorphic priors and allowing unlimited, difficulty-controlled sampling, CellARC offers a precise, reproducible playground to study sample efficiency, compositional generalization, and rule induction across model families.
Loading comments...
loading comments...