🤖 AI Summary
A recent post introduces an information-theoretic framework—“information bandwidth” defined as B = I(S; π*)—to quantify how much learning signal RL algorithms get per episode when fine-tuning LLMs. Under two minimal assumptions (unique optimal policy and finite signal resolution), the author shows policy-gradient (REINFORCE) is fundamentally bottlenecked: compressing a whole trajectory into one scalar return G limits information to ≤ log2(B) bits/episode (binary feedback → ≤1 bit/episode). This formalizes why training needs thousands of episodes, why LoRA’s small parameter budgets work well (LoRA provides ~300–500× more capacity than the policy-gradient ceiling), and why simply adding model parameters won’t fix sample inefficiency when feedback is sparse.
By contrast, actor‑critic methods can in theory deliver dense per-token feedback via TD errors: treating each timestep’s δ_t as a signal yields an upper bound ≤ T·log2(B_δ) bits/episode (e.g., T≈1000 and 8-bit TD errors → ≤8000 bits/episode), because the critic bootstraps historical knowledge to create “surprise” signals. Important caveats: the actor‑critic bound assumes independence of TD errors and represents a theoretical ceiling—bootstrap-induced correlations in practice violate that assumption, so actual achievable bandwidth is unknown. The takeaway for the AI/ML community is clear: sample efficiency is often limited more by signal density and information flow than by model capacity, and improving feedback richness (not just parameters) is the path to much larger gains.
Loading comments...
login to comment
loading comments...
no comments yet