DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning (www.nature.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

Researchers introduced DeepSeek-R1, a family of LLMs whose advanced chain-of-thought (CoT) reasoning emerged primarily through reinforcement learning (RL) rather than supervised CoT examples. Starting from DeepSeek-V3 Base and using Group Relative Policy Optimization (GRPO) with a rule-based reward that judged only final answer correctness (no human reasoning traces and no SFT before RL), the team trained DeepSeek-R1-Zero with a simple prompt template that enforced <think>…</think><answer>…</answer> structure. The model autonomously developed longer, reflective CoT behaviors (verification, dynamic strategy adaptation, “aha” moments) and dramatically improved on verifiable benchmarks: AIME pass@1 rose from 15.6% to 77.9% during RL and reached 86.7% with self-consistency decoding. Similar gains appeared in coding competitions and graduate STEM problems. This work is significant because it shows RL can incentivize emergent, non‑human-like reasoning strategies without costly human-annotated trajectories, offering a scalable alternative to supervised CoT. To address readability, language mixing and broader capability gaps, the authors then built DeepSeek-R1 with a multistage pipeline (cold-start conversational data, RL, rejection sampling, supervised fine‑tuning, and a second RL alignment stage) and distilled smaller models for public release. Implications: RL can unlock sophisticated problem-solving behaviors in LLMs and produce transferable reasoning patterns for smaller models, though careful alignment and mixed training stages remain necessary to preserve generality and human-aligned responses.

Loading comments...

loading comments...