What You Didn't Learn in Berkeley CS 188: Intro to RL (www.neelsomaniblog.com)

🤖 AI Summary
A new write-up, “What You Didn’t Learn in Berkeley CS 188 — Part 1,” reorganizes the standard CS188 RL curriculum into a sharper ontology and pinpoints a notable gap: the absence of a clean “model-free policy iteration” analogue to model-based policy iteration. The author frames RL along two orthogonal axes—model-based vs. model-free and value-based vs. policy-based—then walks through why value iteration and policy iteration are well behaved when you know the MDP (P,R) but break down in the model-free setting. This matters for practitioners because the pedagogical fragments in undergraduate courses don’t naturally bridge to the stochastic, sample-driven algorithms used in real continuous-control problems. Technically, the piece revisits the MDP formalism, shows how Bellman operators are contractions (Banach fixed-point guarantees convergence for value iteration), and contrasts that with model-free learning where you must estimate Q(s,a) from noisy samples. Q-learning uses an exponential moving-average (TD) update to cope with nonstationary targets; convergence proofs rely on stochastic-approximation arguments that map updates to ODEs. Exact policy evaluation is impossible without a model, so “policy iteration” becomes approximate policy iteration—running only k evaluation steps between improvements. With k=1 you recover SARSA (epsilon-greedy TD(0)), and more general TD-n/eligibility traces interpolate multi-step returns. The post concludes by noting tabular methods’ limits—scaling, function approximation, and continuous actions—and teases a follow-up on how to extend these ideas to modern continuous-control methods.
Loading comments...
loading comments...