Reasoning LLMs are wandering solution explorers (arxiv.org)

🤖 AI Summary
A new paper argues that while large language models (LLMs) can look like competent reasoners when using test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning, they actually behave as "wandering solution explorers" rather than systematic problem solvers. Through qualitative and quantitative analysis across multiple state‑of‑the‑art models, the authors formalize what systematic problem solving should entail and expose consistent failure modes: invalid or ungrounded intermediate steps, redundant or unfocused exploration of the solution space, and hallucinated or unfaithful conclusions. The net effect is that models may perform well on simple puzzles but degrade sharply as task complexity and the need for structured search increase. This matters for the AI/ML community because current benchmarks and evaluations often judge only final answers, masking unreliable internal reasoning that undermines trust, verifiability, and composability for complex tasks. The paper recommends shifting evaluation toward the structure and fidelity of the reasoning process itself and developing metrics/tools that measure search coverage, step validity, and faithfulness. Practically, this points to research directions like integrating explicit search/control procedures, step-level verification or proof checking, better uncertainty modeling, and process-aware benchmarks to steer LLMs from wandering explorers toward repeatable, systematic solvers.
Loading comments...
loading comments...