GPT-OSS Reinforcement Learning (docs.unsloth.ai)

🤖 AI Summary
Unsloth announced a new inference and RL stack that lets you train OpenAI’s gpt-oss with GRPO on consumer GPUs — even free 15GB Colab T4s — by rewriting Transformers inference to be RL-compatible and highly efficient. Compared to prior implementations, Unsloth claims ~3x faster int8/4-bit inference (~4x faster in some 4-bit tests), ~21 tokens/s for gpt-oss (BF16 ~30 tokens/s), 50% lower VRAM, and support for up to 8× longer context. Practically, that means gpt-oss-20B RL training on 15GB VRAM and gpt-oss-120B fitting on ~80GB, democratizing RL fine-tuning of a frontier OpenAI architecture outside large lab clusters. The team implemented Unsloth Flex Attention, custom kernels and torch.compile-friendly optimizations to handle KV-cache prefill, per-sequence padding, sliding windows and dynamic causal masks without the quadratic memory blowup of naïve attention. This was necessary because vLLM and FlashAttention have current limitations for gpt-oss RL (vLLM lacks BF16/LoRA support; FlashAttention showed layer-wise numerical divergence and doesn’t support attention sinks in backprop). They also warn that FA3 defaults are unsafe for gpt-oss (it breaks backward attention sinks and training loss). Additional workbench material demonstrates countermeasures for reward hacking in code-generation RL tasks. Overall, Unsloth fills a practical gap — enabling scalable, low-VRAM RL for gpt-oss with careful attention/masking engineering and quant/BF16 trade-offs.
Loading comments...
loading comments...