Intro to RL: Off-Policy Methods (www.neelsomaniblog.com)

🤖 AI Summary
This explainer walks through off-policy RL methods — DDPG, TD3, and SAC — and why they matter: unlike on-policy algorithms (e.g., PPO) that waste data when policies drift, off-policy approaches let you reuse past experience (replay buffers) and scale many parallel actors without requiring each actor to follow the latest policy closely. The deterministic policy gradient (DPG) theorem is central: it expresses the policy gradient as an expectation over the state visitation distribution pθ(s) rather than actions, making deterministic continuous-control methods tractable and enabling actor-critic learning with sampled transitions. Key technical takeaways: DDPG uses an actor μθ and critic Qφ trained from replay, with frozen target networks updated softly (τ) to stabilize updates — but it is famously brittle. TD3 fixes DDPG instability via three D’s: double critics (use min of two target Qs to reduce overestimation), delayed actor updates (update policy less frequently), and target smoothing (add small noise to target actions to regularize sharp Q spikes, approximating a local expectation and penalizing curvature). SAC departs philosophically: it optimizes expected reward plus policy entropy (maximum-entropy RL), yielding a Boltzmann-style soft policy, soft Bellman backups, double critics, and automatic temperature α tuning. Together, these methods enable more sample-efficient, scalable continuous control, at the cost of added algorithmic complexity and hyperparameter sensitivity.
Loading comments...
loading comments...