🤖 AI Summary
Reproducibility in deep learning — amplified by floating‑point non‑associativity and divergent GPU code paths — remains a practical blocker for safety‑critical and verifiable AI (e.g., zero‑knowledge proofs that require bit‑identical traces). The team tested across three NVIDIA GPUs (RTX 3090, RTX 4080, L4) using PyTorch and llama.cpp, toggling every determinism flag (seeds, deterministic algorithms, disabling TF32/cuDNN autotuning). Quantization (GGUF/int8) did not fix the issue because LayerNorm stayed in FP and weights are often dequantized at runtime, reintroducing floating‑point variance. Tracing into PTX and cuBLAS showed the primary drift came from GEMM kernels: architecture‑specific kernel generation and tensor‑core optimizations produced differing instruction orders and tiny 1e‑4 errors that accumulate across token steps.
The breakthrough was rewriting deterministic GEMM CUDA kernels: avoid tensor cores, force a deterministic operation order, and compile for multiple architectures so CUDA cores execute consistently. This change produced identical outputs across all tested machines and models (Llama/Mistral variants, fp16 and quantized), demonstrating that low‑level kernel control can eliminate cross‑device divergence. There is a cost: prompt processing throughput dropped substantially (example: from ~302 tps to ~43 tps), while text‑generation throughput remained ~44–45 tps after optimizations. Implications are broad — reproducibility sometimes requires kernel‑level fixes, not just framework flags or quantization — and the team plans to collaborate with framework and hardware vendors to upstream deterministic kernels for safer, verifiable AI.
Loading comments...
login to comment
loading comments...
no comments yet