🤖 AI Summary
LLMs are not deterministic—even at temperature = 0—so identical prompts can yield different outputs, undermining reproducibility, debugging, and production stability. The article proposes three intervention layers. At the generation layer: constrain decoding with schemas/grammars to shrink the valid-output space, prefer retrieval/caching for repeated queries, canonicalize and reduce input entropy (menus, normalization, paraphrase detection), and optimize prompts for bitwise or field-level consistency. These measures turn volatility into deterministic lookups or tightly constrained searches rather than free-form sampling.
At the learning and infrastructure layers: make the model’s conditional distribution more decisive by selecting a canonical target y* and increasing the token-margin so the top token is higher by γ (use a margin loss L_margin = Σ_t max(0, γ − (ℓ_{y_t*} − max_{k≠y_t*} ℓ_k)) and add ranking losses against semantically valid negatives L_rank = log(1+exp(s(x,ỹ) − s(x,y*))). And fix runtime nondeterminism by pinning hardware/software (GPU, drivers, CUDA/cuBLAS/cuDNN, tokenizer/quant configs), enforcing deterministic flags (e.g., torch.use_deterministic_algorithms(True), CUBLAS_WORKSPACE_CONFIG=:4096:8), and eliminating batch-induced floating‑point reduction variability—because different batching/chunking changes reduction order, slightly perturbs logits, and can flip argmaxes. Together these strategies make LLM inference far more stable for research and systems.
Loading comments...
login to comment
loading comments...
no comments yet