Bitwise Consistent On-Policy Reinforcement Learning with VLLM and TorchTitan (blog.vllm.ai)

0 points 248 days ago ago | visit original

🤖 AI Summary

Researchers demonstrated an open-source, bitwise-consistent on-policy RL run that pairs TorchTitan as the trainer with vLLM as the inference engine, fine-tuning Qwen3 1.7B on a GSM8K correctness-reward task. Building on vLLM’s batch-invariant inference work, they audited every forward-kernel call to ensure bitwise equivalence across frameworks, imported vLLM’s fused forward ops (e.g., SiLU MLPs, RMSNorm with residuals) and implemented custom backward passes in PyTorch. Empirically, runs where trainer and sampler used different kernels showed degraded reward over 100 steps, while enabling bitwise-exact training (batch_inv_ON, where kl_div = 0.0) produced faster convergence and higher total reward — underscoring how tiny numerical mismatches can destabilize RL. Technically, the team wrapped generation and weight updates in a VLLMRolloutEngine and ran synchronous on-policy alternation between trainer and generator on a single host to prove exactness. Current limitations: the setup is 2.4× slower than non-bitwise runs, relies on dual model code paths (trainer vs inference), and uses eager-mode TorchTitan while vLLM leverages torch.compile. Next steps include a unified model definition, compilation support to reconcile torch.compile across both stacks, wider model/operator coverage, and kernel tuning to recover performance. This work highlights that cross-framework numerical determinism is crucial for RL stability, reproducibility, and reliable fine-tuning at scale.

Loading comments...

loading comments...