High-Performance Branch-Free Algorithms for Extended-Precision Floating-Point [pdf] (theory.stanford.edu)

0 points 11 hours ago ago | visit original

🤖 AI Summary

Stanford researchers David Zhang and Alex Aiken introduce a family of high-performance, branch-free algorithms—called floating-point accumulation networks (FPANs)—for extended-precision floating‑point arithmetic (2×, 3×, 4× machine precision and beyond). The paper extends FPANs from addition to subtraction, multiplication, division and square root, operating on floating‑point expansions of 2–4 machine‑precision terms to deliver effective quadruple, sextuple or octuple precision on CPUs and GPUs without dynamic allocation or branching. Critically, the authors built an SMT‑based automated rounding‑error analysis and machine‑checkable proofs so each FPAN has a formal correctness guarantee for all floating‑point inputs (within overflow/underflow), enabling systematic search and discovery of algorithms that are conjectured optimal in exact FLOP count and circuit depth. In benchmarks their implementations outperform state‑of‑the‑art software multiprecision libraries by orders of magnitude (overall speedups reported roughly 11.7–69.3×; e.g., ~11.7× vs QD and ~34–41× vs CAMPARY, MPFR/FLINT variants). For the AI/ML and HPC communities this makes reliable extended precision practical on modern data‑parallel hardware where previous methods were prohibitively slow or branching‑heavy. That lowers the barrier to using higher precision to mitigate instability, nonreproducibility and catastrophic rounding in large‑scale scientific and ML workloads (climate, energy grids, physics‑informed models, long‑training deep nets), while preserving SIMD/SIMT performance. The combination of branch‑free, data‑parallel-friendly kernels plus formal verification is especially important for production ML systems and scientific computing where both speed and guaranteed numerical behavior matter.

Loading comments...

loading comments...