🤖 AI Summary
Kimi 1.5 presents a compact RL formulation that treats model generations and critic feedback as intermediate “reasoning” steps inside a long-context chain-of-thought (CoT), removing the need for an explicit search tree. The training objective is a relative-entropy (KL)–regularized policy optimization: the optimal policy has a closed-form Boltzmann-like solution proportional to the reference policy times exp(reward/temperature), and the practical gradient reduces to an advantage-style term (reward minus mean reward) times grad log-prob plus a KL regularizer that keeps the new policy close to the reference. They intentionally omit a learned value model for efficiency and stability, approximate the KL correction with off-policy stored trajectories, and add a length penalty (normalized between min/max lengths) to discourage runaway response growth.
Operationally, Kimi 1.5 couples rollout workers, a replay buffer, reward models, and training workers with support for partial rollouts so sequences longer than context length can be continued or excluded, and repetitive non-unique rollouts are terminated with penalties. Sampling mixes curriculum learning (easy→hard) with prioritized replay proportional to (1 − success rate). To solve Megatron/vLLM weight-compatibility and GPU idling, they deploy both engines as sidecars in a single Kubernetes pod, sharing GPU memory and coordinating via a master process (etcd) for rapid weight handoff and efficient rollout generation—enabling scalable, long-context RL training with reused trajectories.
Loading comments...
login to comment
loading comments...
no comments yet