Can LLMs Play Baba Is You? (meffmadd.github.io)

🤖 AI Summary
An innovative experiment sought to evaluate the capabilities of various large language models (LLMs) in solving the puzzle game "Baba Is You," which requires players to manipulate rules to achieve victory. Using an OpenCode-based agent, the researcher adapted existing tools and introduced new functionalities to facilitate the LLM's interaction with the game's unique structure. The evaluation revealed a stark performance divide between closed and open frontier models, with Gemini 3.1 Pro successfully solving all levels while GLM 5.1 performed well among its open-weight counterparts, solving 5 out of 8 levels. This study is significant for the AI/ML community as it highlights the differing capabilities of models in niche tasks, reinforcing the idea that closed frontier models may have superior performance due to their training on more specialized datasets. Technical insights emerged from the detailed evaluation process, including the use of A* pathfinding algorithms for problem-solving and a comparison of tool usage across models. Overall, the results suggest that as new methodologies and benchmarks like ARC AGI 3 are integrated, the capabilities of LLMs in complex tasks will continue to evolve, pushing the boundaries of AI problem-solving.
Loading comments...
loading comments...