I got the highest score on ARC-AGI again swapping Python for English (jeremyberman.substack.com)

🤖 AI Summary
A researcher rebuilt their ARC-AGI solver using an Evolutionary Test-Time Compute loop that swaps generated Python programs for plain-English instructions and set new benchmarks: 79.6% on ARC v1 (at $8.42 per task, ~25× more cost-efficient than o3) and a new state-of-the-art 29.4% on ARC v2 (previous best 25%). The system uses Grok-4 to create up to 40 candidate natural-language transformation instructions per task (30 initial + 5 individual revisions + 5 pooled revisions), with subagent models applying and scoring those instructions on training examples to produce fitness values. High-scoring candidates are iteratively refined—individual revisions use ASCII diffs of predicted vs. ground-truth grids, while pooled revisions synthesize multiple parents—balancing exploration and focused repair within token and compute constraints. Technically significant because it demonstrates that evolving natural-language “programs” at test time can capture nuanced pattern-recognition and contextual rules that brittle Python code struggled to express on ARC v2. The work highlights model limitations—“dead reasoning zones” and fused, domain-specific reasoning circuits—and argues for bringing general deductive reasoning into the training distribution (via approaches like RL and test-time evolution) as a path toward more transferable, AGI-like capabilities. For practitioners, it’s a practical, compute-efficient recipe for improving zero-shot generalization on abstract reasoning benchmarks and a conceptual nudge toward language-based, test-time program synthesis.
Loading comments...
loading comments...