Stable LLM Inference (www.gojiberries.io)

🤖 AI Summary
LLMs are not deterministic—even at temperature = 0—so identical prompts can yield different outputs, undermining reproducibility, debugging, and production stability. The article proposes three intervention layers. At the generation layer: constrain decoding with schemas/grammars to shrink the valid-output space, prefer retrieval/caching for repeated queries, canonicalize and reduce input entropy (menus, normalization, paraphrase detection), and optimize prompts for bitwise or field-level consistency. These measures turn volatility into deterministic lookups or tightly constrained searches rather than free-form sampling. At the learning and infrastructure layers: make the model’s conditional distribution more decisive by selecting a canonical target y* and increasing the token-margin so the top token is higher by γ (use a margin loss L_margin = Σ_t max(0, γ − (ℓ_{y_t*} − max_{k≠y_t*} ℓ_k)) and add ranking losses against semantically valid negatives L_rank = log(1+exp(s(x,ỹ) − s(x,y*))). And fix runtime nondeterminism by pinning hardware/software (GPU, drivers, CUDA/cuBLAS/cuDNN, tokenizer/quant configs), enforcing deterministic flags (e.g., torch.use_deterministic_algorithms(True), CUBLAS_WORKSPACE_CONFIG=:4096:8), and eliminating batch-induced floating‑point reduction variability—because different batching/chunking changes reduction order, slightly perturbs logits, and can flip argmaxes. Together these strategies make LLM inference far more stable for research and systems.
Loading comments...
loading comments...