Clockbench AI benchmark: 89.1% humans vs. 13.3% top LLM (clockbench.ai)

🤖 AI Summary
ClockBench, a new visual reasoning benchmark, reveals a striking gap between human and state-of-the-art large language models (LLMs) in reading analog clocks—a task that humans easily perform with 89.1% accuracy, but top models like Gemini 2.5 Pro only achieve 13.3%. Despite LLMs demonstrating strong capabilities across various reasoning and visual tasks, their performance on ClockBench underscores the challenge of integrating precise visual reasoning with temporal understanding. This benchmark requires models to interpret clock faces, validate times, adjust for time zones, and manipulate clock hands within a structured JSON output, pushing beyond text-based reasoning into complex multimodal comprehension. The results suggest current frontier models may lack the specific visual-spatial reasoning needed to process analog clocks accurately, highlighting an important blind spot in AI development. Researchers speculate that scaling existing LLM paradigms alone may not suffice to close this gap, indicating potential need for novel architectures or training methods that better fuse visual data with temporal reasoning. By making a dataset and evaluation code publicly accessible, ClockBench offers the AI/ML community a valuable tool to explore and advance multimodal time reasoning, a fundamental skill with broad applications in AI perception and human-like understanding.
Loading comments...
loading comments...