Is GRPO Broken? (www.neelsomaniblog.com)

🤖 AI Summary
The final post in a reinforcement-learning primer examines Direct Preference Optimization (DPO) and the newer Group Relative Policy Optimization (GRPO, DeepSeek 2024), arguing that pairwise GRPO is mathematically the same as DPO and, in disguise, as a REINFORCE-style objective. The author clarifies how DPO turns pairwise human preferences into a latent reward model: assume a hidden reward r such that preference probabilities depend only on reward differences, choose a link function (DPO uses the logistic sigmoid, i.e., Gumbel noise), solve for the closed-form policy (softmax-like with a temperature β) and reduce the problem to supervised maximum likelihood. A KL penalty keeps the updated policy close to a reference policy—practically like PPO’s constraint but motivated heuristically (or as a prior), not by the same distributional argument PPO uses. Technically, GRPO in the pairwise case is identical to DPO after reparameterization: you can rewrite the DPO objective as a REINFORCE gradient with a synthetic reward. The group extension — allowing more than two responses by assigning any weights that sum to zero — is where the theory weakens; DPO’s pairwise weights are fixed by the sigmoid structure, whereas GRPO generalizes them without the same principled derivation. Practically, GRPO often works empirically and online GRPO mirrors PPO-style centering of k sampled responses, but the post cautions that GRPO’s theoretical foundations are shakier than those of DPO/PPO.
Loading comments...
loading comments...