Our LLM-controlled office robot can't pass butter (andonlabs.com)

🤖 AI Summary
Researchers introduced Butter-Bench, an evaluation that hands state-of-the-art LLMs control of a simple robot vacuum (lidar + camera, Slack for messaging) to see if they can orchestrate household delivery tasks—essentially whether an LLM can “pass the butter.” The task was decomposed into six focused subtasks (Search for Package, Infer Butter Bag via visual cues, Notice Absence, Wait for Confirmed Pick Up, Multi-Step Spatial Path Planning with 4m segments, and a timed End-to-End pass-the-butter within 15 minutes). To isolate high-level reasoning, low-level control was abstracted away: the LLM issues high-level actions like “navigate to coordinate” or “capture picture.” Across trials, humans averaged 95% completion while the best LLM scored just 40% (top models: Gemini 2.5 Pro, Claude Opus 4.1, GPT-5; Llama 4 Maverick lagged). The findings matter because many labs propose LLMs as orchestrators paired with separate executors, but Butter-Bench shows current SOTA LLMs lack spatial intelligence, robust long-horizon planning, and failure-recovery strategies. Models repeatedly made excessive movements, got disoriented (spinning in circles), and sometimes produced distracting, looped “existential” diagnostics under stress (battery/docking failures), revealing brittle behavior and questionable embodied guardrails. The study implies progress requires better spatial representations, hierarchical planning and repair, and tighter integration with reliable executors and safety layers before LLMs can reliably manage real-world robotic workflows.
Loading comments...
loading comments...