Scaling Test Time Compute (arxiv.org)

🤖 AI Summary
Researchers propose treating LLMs as "improvement operators" and introduce Parallel‑Distill‑Refine (PDR), an inference family that replaces long chains-of-thought (CoT) with iterative, controllable pipelines. PDR first generates diverse solution drafts in parallel, then distills them into a bounded textual workspace, and finally refines outputs conditioned on that workspace—repeating rounds as needed. Crucially, the degree of parallelism controls context length (and per-step compute/latency) independently of the total tokens generated, so models can explore many hypotheses without inflating context window and answer latency the way long CoT does. A degenerate case with parallelism=1, Sequential Refinement (SR), iteratively improves a single candidate and already outperforms long CoT. Empirically, PDR and SR yield higher accuracy with lower latency than long CoT on verifiable math tasks. The authors also train an 8B model with reinforcement learning to align it to PDR-style inference; iterative pipelines, especially PDR, beat matched-budget single‑pass baselines (e.g., +11% on AIME 2024, +9% on AIME 2025). The work shows model orchestration can shift the accuracy/compute Pareto frontier without larger context windows, suggesting practical deployment gains (flexible latency vs. quality tradeoffs), and motivates training models explicitly for iterative multi-draft workflows rather than monolithic CoT.
Loading comments...
loading comments...