Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning (arxiviq.substack.com)

🤖 AI Summary
Researchers demonstrated the first successful scaling of Evolution Strategies (ES) to fine-tune full-parameter multi-billion-parameter LLMs, bypassing backpropagation entirely. They revive a simplified Natural Evolution Strategies (NES) method that perturbs a pretrained model θ with Gaussian noise (θ + σ·ε_n), evaluates rewards R_n for N perturbed models, and updates θ by a weighted sum of noise vectors (θ ← θ + α Σ(R_n · ε_n) with normalized rewards). The engineering innovations that make this feasible at scale include storing only random seeds (not full noise/model copies), in-place layer-by-layer perturbations to cut peak GPU memory, massive embarrassingly parallel forward-pass evaluations, and within-generation z-score reward normalization. With surprisingly small populations (N = 30), the method is highly sample-efficient (<20% of data compared to RL), avoids backprop memory costs, and scales across Qwen and LLaMA families. Empirically, ES outperformed PPO and GRPO on tasks like the Countdown reasoning benchmark (e.g., 60.5% vs 32.5% on Qwen‑2.5‑3B) and produced a superior reward vs KL-divergence Pareto front for conciseness tuning without explicit KL penalties. ES showed far lower run-to-run variance (rewards and KL, ~15.5× more stable) and resisted reward-hacking failure modes common in RL. The paper argues parameter-space search effectively Gaussian-smooths jagged reward landscapes and optimizes distributions of solutions, making ES more robust for sparse, long-horizon, or outcome-only rewards and opening a practical, parallelizable alternative to RL for LLM alignment and novel unsupervised fine-tuning objectives.
Loading comments...
loading comments...