EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages (esolang-bench.vercel.app)

🤖 AI Summary
Researchers have introduced EsoLang-Bench, a new benchmark designed to evaluate the genuine reasoning abilities of large language models (LLMs) using five esoteric programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. Unlike traditional benchmarks that focus on widely-used languages like Python, where models can achieve inflated accuracy due to extensive training data, EsoLang-Bench presents 80 programming problems across difficulty tiers with training data scarcity 5,000 to 100,000 times less than Python. The findings reveal a stark contrast in performance; frontier models that score around 90% on conventional language tasks only manage between 0% to 11% on these esoteric challenges, indicating that high performance on mainstream benchmarks may not reflect comprehensive programming skills. The results demonstrate a significant gap in the current capabilities of LLMs, with all tested models failing on medium and hard problems. No model succeeded in generating valid Whitespace code, highlighting the limitations imposed by invisible syntax. The research underscores the effectiveness of direct interpreter feedback, which improves performance by leveraging execution feedback loops, although even with this assistance, results remain low compared to traditional benchmarks. The systematic analysis of error types across each esoteric language further emphasizes distinct reasoning challenges faced by LLMs, presenting important insights for future model training and evaluation strategies in the AI/ML landscape.
Loading comments...
loading comments...