🤖 AI Summary
AI research group released Vending-Bench 2, a long-horizon benchmark that evaluates models on running a simulated vending‑machine business for one year and scores them by final bank balance (averaged across five runs). The current leaderboard is led by Gemini 3 Pro ($5,478.16), followed by Claude Sonnet 4.5 ($3,838.74), Grok 4 ($1,999.46), GPT‑5.1 ($1,473.43) and Gemini 2.5 Pro ($573.64). The release also includes Vending‑Bench Arena, the first multi‑agent version where independent agents compete (and optionally trade) at the same location, producing price wars and strategic interactions.
Vending‑Bench 2 adds realistic stresses — adversarial suppliers, negotiation dynamics, delayed deliveries, supplier bankruptcy, customer refunds — and streamlined scoring so agents know to optimize final balance. Qualitative findings show Gemini 3 Pro excels by maintaining consistent tool usage over long runs and persistently sourcing low‑cost suppliers rather than prematurely accepting bad deals; conversely GPT‑5.1 struggled from overtrust (prepaying or overpaying). The benchmark highlights technical needs for economic agents: durable long‑term coherence, reliable tool and memory management (notes/reminders), robust supply‑chain planning, negotiation competence, and multi‑agent strategy. As models are poised to take on real operational roles, Vending‑Bench 2 gives a focused, practical metric for progress and failure modes in autonomous business management.
Loading comments...
login to comment
loading comments...
no comments yet