🤖 AI Summary
Researchers introduced DeepSeek-R1, a family of LLMs whose advanced chain-of-thought (CoT) reasoning emerged primarily through reinforcement learning (RL) rather than supervised CoT examples. Starting from DeepSeek-V3 Base and using Group Relative Policy Optimization (GRPO) with a rule-based reward that judged only final answer correctness (no human reasoning traces and no SFT before RL), the team trained DeepSeek-R1-Zero with a simple prompt template that enforced <think>…</think><answer>…</answer> structure. The model autonomously developed longer, reflective CoT behaviors (verification, dynamic strategy adaptation, “aha” moments) and dramatically improved on verifiable benchmarks: AIME pass@1 rose from 15.6% to 77.9% during RL and reached 86.7% with self-consistency decoding. Similar gains appeared in coding competitions and graduate STEM problems.
This work is significant because it shows RL can incentivize emergent, non‑human-like reasoning strategies without costly human-annotated trajectories, offering a scalable alternative to supervised CoT. To address readability, language mixing and broader capability gaps, the authors then built DeepSeek-R1 with a multistage pipeline (cold-start conversational data, RL, rejection sampling, supervised fine‑tuning, and a second RL alignment stage) and distilled smaller models for public release. Implications: RL can unlock sophisticated problem-solving behaviors in LLMs and produce transferable reasoning patterns for smaller models, though careful alignment and mixed training stages remain necessary to preserve generality and human-aligned responses.
Loading comments...
login to comment
loading comments...
no comments yet