🤖 AI Summary
A new benchmark called RegexPSPACE has been introduced to evaluate the reasoning capabilities of large language models (LLMs) and large reasoning models (LRMs) on PSPACE-complete regular expression problems, specifically equivalence decision (RegexEQ) and minimization (RegexMin). This benchmark challenges the existing understanding of LLMs' computational limits, especially concerning spatial complexity within their finite context windows. Unlike previous benchmarks that focused on NP complexity problems, RegexPSPACE aims to assess models using a more rigorous standard, requiring extensive search space explorations.
The dataset for this benchmark includes over a million regex instances, meticulously constructed through double-exponential space exploration and sound filtering processes. Evaluations performed on six LLMs and five LRMs revealed common failure patterns such as verbosity and repetition, highlighting the need for better understanding and improvements in spatial computations. This effort not only marks the first empirical investigation into the spatial limitations of LLMs and LRMs but also provides a structured framework with quantitative metrics for evaluating their advanced reasoning abilities, thereby contributing significantly to the AI/ML community's understanding of model capabilities.
Loading comments...
login to comment
loading comments...
no comments yet