LLMs Encode How Difficult Problems Are (arxiv.org)

🤖 AI Summary
Researchers analyzed whether large language models internally encode problem difficulty and how that signal interacts with post‑training RL. Training linear probes across layers and token positions on 60 models (evaluated on math and coding subsets of the Easy2HardBench), they find human-labeled difficulty is strongly linearly decodable (Pearson ρ ≈ 0.88) and scales with model size. By contrast, difficulty estimates derived from model behavior are much weaker and don't scale well. Manipulating activations "along the difficulty direction"—i.e., steering internal representations toward easier or harder—reduces hallucinations and boosts accuracy when moved toward easier representations. During GRPO reinforcement learning on Qwen2.5‑Math‑1.5B, the human‑difficulty probe signal strengthens and tracks test accuracy positively, while the model‑derived difficulty probe degrades and correlates negatively with performance. The work implies that human annotations capture a stable, generalizable notion of problem difficulty that RL tends to amplify, whereas automated difficulty proxies based on model outputs can become misaligned as models improve. Practically, this suggests using human‑anchored difficulty signals for curriculum design, probing, and RL objectives to reduce hallucination and guide generalization. The authors release probe code and evaluation scripts to support replication and follow‑up studies.
Loading comments...
loading comments...