RND1: Simple, Scalable AR-to-Diffusion Conversion (www.radicalnumerics.ai)

0 points 9 hours ago ago | visit original

🤖 AI Summary

Radical Numerics announced RND1-Base, the largest open-source diffusion language model (DLM) to date: an experimental 30B-parameter sparse Mixture-of-Experts (MoE) with 3B active parameters, converted from a pretrained autoregressive (AR) checkpoint (Qwen3-30BA3B) and continually pretrained for ~500B tokens. They release the model, training recipe, inference code and samples, and show RND1 outperforms prior open diffusion baselines (Dream-7B, LLaDA-8B) on benchmarks across reasoning (MMLU, ARC-C, RACE, BBH), STEM (GSM8K) and code (MBPP). The project demonstrates that A2D (AR-to-diffusion) conversion can scale beyond 8B parameters while preserving the benefits of AR pretraining and enabling the inherent parallel-generation advantage of diffusion models. Technically, Radical Numerics emphasizes simplicity and stability: Simple Continual Pretraining (SCP) converts an AR model by swapping the causal mask for a bidirectional mask and continuing training under a masked diffusion objective with LR warmup—eschewing more complex annealing or grafting schemes. To avoid catastrophic forgetting they use layer-specific learning rates (higher for attention, lower for MLPs/embeddings), and they empirically measure critical batch size via branched training, finding diffusion pretraining benefits from very large effective batch sizes (loss improves up to an effective scale point they measured). The work shows principled AR pretraining practices + modest conversion techniques can yield scalable, high-performance DLMs and provides an open foundation for further experimentation and model customization.

Loading comments...

loading comments...