HRM Analysis by Arc Prize Organizers (arcprize.org)

🤖 AI Summary
Arc Prize organizers reproduced and stress‑tested the viral Hierarchical Reasoning Model (HRM) from Guan Wang et al. and confirmed the headline claim qualitatively: a 27M‑parameter, brain‑inspired model that uses short iterative “thinking” bursts, paired H (slow planner) and L (fast worker) recurrent modules, an adaptive halt signal (ACT), and heavy task augmentation can reach strong ARC‑AGI‑1 performance. On the Semi‑Private hold‑out set their run scored 32% (vs the paper’s 41% public claim) with reasonable runtime/cost; ARC‑AGI‑2 performance was negligible (2%). HRM predicts in an embedding/transduction space, de‑augments augmented task outputs and majority‑votes results rather than synthesizing explicit programs. Their ablation study sharpens the picture: the hierarchical H/L architecture gives only a small advantage— a vanilla transformer with the same params comes within ~5pp—while the outer iterative refinement loop is the dominant contributor (1→2 outer loops yields +13pp, and up to 8 loops doubles public performance). Training with many refinement steps matters more than running many steps at inference, and heavy per‑task training is a big factor: training only on the evaluation tasks still achieves ~31% pass@2, implying HRM’s gains largely stem from test‑time or zero‑pretraining task‑specific training (akin to Liao & Gu’s “ARC‑AGI without pretraining”). Augmentations materially help but show diminishing returns (≈300 of the paper’s 1,000 augmentations suffices). Implication: iterative refinement + task‑specific training/augmentation—not a novel representation architecture—appear to drive HRM’s ARC success, raising important questions about benchmarking, generalization, and ARC Prize submission rules.
Loading comments...
loading comments...