Defeating Nondeterminism in LLM Inference (thinkingmachines.ai)

🤖 AI Summary
Reproducibility in large language model (LLM) inference is surprisingly challenging, even under theoretically deterministic conditions like greedy sampling (temperature set to zero). While common explanations attribute nondeterminism to floating-point non-associativity combined with concurrency on GPUs—where parallel thread execution order might vary—the reality is more nuanced. The authors clarify that most GPU kernels used during an LLM’s forward pass are deterministic and avoid nondeterministic atomic operations, meaning that repeated matrix computations yield bitwise identical results. Thus, the forward pass itself is inherently run-to-run deterministic. The deeper source of nondeterminism arises from how batched requests are processed and the lack of “batch invariance.” Although each forward pass is deterministic given fixed inputs, the outputs for individual requests can vary depending on which other queries run concurrently in the batch. This dependency affects the final results in ways unrelated to floating-point concurrency issues. The insight shifts focus from hardware-level numerical uncertainties to software-level batch processing design, emphasizing that truly reproducible LLM inference requires controlling batch inputs and ensuring batch-invariant computations. Technically, this means nondeterminism is not simply caused by floating-point arithmetic quirks but by how batch compositions influence computations in inference servers. Recognizing and addressing this can help the AI/ML community design inference pipelines and kernels that consistently deliver reproducible outputs, advancing scientific rigor and reliability in LLM research and deployment.
Loading comments...
loading comments...