Defeating the Training-Inference Mismatch via FP16 (arxiv.org)

🤖 AI Summary
Researchers show that the brittle behavior often seen in RL fine-tuning of large language models stems not from algorithmic bugs but from floating-point precision choices: the widely used bfloat16 (BF16) format, while offering a large dynamic range, has coarse mantissa precision that introduces significant rounding error and breaks numerical consistency between the training policy and the inference policy. The paper demonstrates a surprisingly simple fix — switch to IEEE FP16 uniformly during RL fine-tuning — which removes the training–inference mismatch without changing model architectures or learning algorithms. The change is trivial to implement in modern ML frameworks (a few lines of code) and is fully supported by existing hardware. Technically, the result highlights that BF16’s lower fractional precision can alter logits, sampling probabilities, and gradient signals enough to destabilize policy optimization; FP16’s finer fractional precision (despite a smaller exponent range) restores consistency and yields empirically more stable optimization, faster convergence, and better final performance across tasks, algorithms, and frameworks. The takeaway for practitioners is immediate: reconsider precision trade-offs in RLHF and other RL fine-tuning workflows — uniform FP16 can be a low-cost, high-impact lever to improve robustness and outcomes.
Loading comments...
loading comments...