Scaling Reinforcement Learning for Trillion-Scale Thinking Model (arxiv.org)

0 points 5 days ago ago | visit original

🤖 AI Summary

Researchers released Ring-1T, an open-source "thinking" model with 1 trillion parameters (a sparsely activated MoE that activates ~50 billion parameters per token) and a suite of reinforcement-learning innovations enabling stable, efficient training at this scale. The team identifies three core scaling pain points—train/inference misalignment, rollout inefficiency under token budgets, and RL-system bottlenecks—and introduces IcePop (token-level discrepancy masking and clipping to stabilize RL against training/inference mismatch), C3PO++ (dynamic partitioning of long rollouts to improve time- and compute-efficiency), and ASystem (a high-performance RL framework to remove system-level bottlenecks). Together these methods let Ring-1T be trained end-to-end with policy optimization at trillion-parameter scale. The model achieves state-of-the-art open benchmarks—93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, 55.94 on ARC-AGI-v1—and a silver-medal–level result on IMO-2025, highlighting strong multi-step reasoning and code/problem-solving ability. By releasing the full 1T MoE model and associated RL tooling, the work lowers the barrier for community research into trillion-parameter reasoning systems and establishes practical techniques (token-level RL fixes, rollout partitioning, and system engineering) that other labs can adopt when scaling sparse, RL-trained models.

Loading comments...

loading comments...