A vision researcher's guide to some RL stuff: PPO and GRPO (yugeten.github.io)

🤖 AI Summary
A vision researcher’s blog post unpacks Proximal Policy Optimization (PPO) and DeepSeek’s Group Relative Policy Optimization (GRPO) in the context of RLHF for LLMs, highlighting two bold moves from DeepSeek’s R1 work: (1) skipping supervised finetuning (SFT) and applying RL directly to the base model (R1-zero), and (2) replacing PPO with GRPO to remove the separate critic/value network. Together these choices cut post‑training compute and memory costs (DeepSeek reports ~50% reduction by dropping the critic), allow more exploratory “self‑evolution” of reasoning, and reduce SFT‑induced biases — but they depend on having a very strong base model to start with. Technically, the post recaps RLHF: sample multiple responses, have humans rank them, train a reward model (Bradley‑Terry pairwise loss: −log sigmoid(R(pi)−R(pj))), then use RL to maximize reward. PPO is an actor‑critic scheme: policy πθ (the LLM), frozen reward model Rφ (scores only complete responses), and a critic Vγ trained with L2 loss to predict final rewards for partial token sequences so you can compute Generalized Advantage Estimation (GAE). GAE blends multi‑step TD and Monte‑Carlo via λ to trade bias/variance. GRPO sidesteps the critic by optimizing relative group scores directly—simplifying training and halving resource needs—making RLHF more practical at scale but shifting reliance onto reward-model quality and base model capabilities.
Loading comments...
loading comments...