🤖 AI Summary
            Researchers from CUHK (Shenzhen) released BesiegeField, a physics sandbox and benchmark for "agentic" LLMs that design compositional machines (e.g., trebuchets, catapults, cars). Agents generate parts and assembly plans, run simulated rollouts, and iteratively edit designs; inspector agents produce chain-of-thought-style spatial reasoning to check assemblies. Human-designed trebuchets still outperform LLM-built machines, and the authors show how removing a single component can collapse a mechanism—highlighting the fragility and combinatorial difficulty of compositional mechanical design.
The paper evaluates many LLMs (Gemini 2.5 Pro, OpenAI o3, Qwen, Claude, Llama variants) across single-agent, iterative-editing, and hierarchical design workflows, finding large variance and generally moderate spatial/physics reasoning. Importantly, adding reinforcement learning with verifiable rewards (RLVR) to Qwen2.5-14B improved validity and performance: cold-start + RL raised max catapult scores (from ~2.4 to ~7.1) and dramatically increased the best car launch distance (max score up to ~45.7). Takeaways for the AI/ML community: BesiegeField provides a reproducible testbed to stress compositionality, hierarchical planning, and verifiable reward design; results suggest progress but also clear limits in current LLMs’ ability to reliably synthesize coordinated mechanical systems without stronger physics priors, goal-aware RL, and iterative/hierarchical agent structures.
        
            Loading comments...
        
        
        
        
        
            login to comment
        
        
        
        
        
        
        
        loading comments...
        no comments yet