Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

A recent study challenges the assumption that Reinforcement Learning from Verifiable Rewards (RLVR) enhances the reasoning capabilities of language models. The research introduces two metrics—Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR)—to evaluate how effectively a model's reasoning contributes to its answers. The experiments conducted using the Qwen2.5 model series reveal that while RLVR improves overall task accuracy, it does not ensure improvements in CIR or SR. This raises critical questions about the effectiveness of reasoning in AI models, indicating that RLVR may not foster the reliable reasoning chains that were previously expected. Significantly, the study suggests that incorporating a small amount of Supervised Fine-Tuning (SFT) before applying RLVR can enhance CIR and SR. Furthermore, even without SFT, the introduction of auxiliary rewards based on CIR and SR can lead to causally important and sufficient reasoning, matching RLVR’s accuracy. These findings indicate that adjustments to post-training procedures are essential for achieving meaningful reasoning in AI models, which could pave the way for more reliable applications of AI in complex reasoning tasks.

Loading comments...

loading comments...