Benchmarking World-Model Learning (arxiv.org)

🤖 AI Summary
The authors introduce WorldTest, a new evaluation protocol for world-model learning that decouples reward-free exploration from a scored test phase in a related but different environment. This design pushes models to acquire general, reusable knowledge about environment dynamics rather than overfitting to next-frame prediction or to reward maximization in the same training environment. WorldTest is intentionally open-ended and representation-agnostic: models must support many downstream tasks they weren't explicitly trained for, and success is measured behaviorally by how well derived tests are solved, not by likelihood or in-domain returns. They instantiate WorldTest as AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks spanning masked-frame prediction, planning, and predicting changes to causal dynamics. The authors evaluated 517 human participants and three state-of-the-art models. Humans substantially outperformed the models, and increasing compute only improved model performance in some environments, revealing significant headroom for learning transferable world models. Key takeaways for the AI/ML community are that current benchmarks and objectives (e.g., next-frame prediction, same-env rewards) can mask poor generalization, and that progress will require better reward-free exploration, richer test-time task suites, and evaluation frameworks that probe causal and planning competencies across environment shifts.
Loading comments...
loading comments...