Stringzilla v4 Introduces 500 GigaCUPS Edit Distance on GPUs (ashvardanian.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

StringZilla v4 (CUDA) is out as a pip-installable package, delivering “500+ GigaCUPS” of edit-distance throughput on GPUs and a suite of CPU/GPU-accelerated string kernels for IR, databases, datalakes and bioinformatics. The release adds massively-parallel implementations of Levenshtein, Needleman–Wunsch and Smith–Waterman (with Gotoh affine gaps), GPU-accelerated MinHashing, new intersection/sort kernels for DBMS workloads, and AES-based hashing/PRNGs — all with first-party bindings for Python, Rust, JS and Swift. On an NVIDIA H100 the Levenshtein kernel reached 624,730 MCUPS (~625 GigaCUPS), dwarfing many CPU libs and even outperforming RAPIDS’ nvtext in the author’s benchmarks. Technically, StringZilla achieves this by reordering dynamic-programming to anti-diagonals (storing three diagonals so entire diagonals compute in parallel), using GPU SIMD primitives (DP4A/DPX) and exploiting port-parallel AES+SIMD recipes for high-throughput hashing and PRNGs. Bioinformatics use-cases get affine-gap support (three matrices) and protein scoring via substitution matrices; current CUDA code uses constant memory for the substitution matrix (noted as a bottleneck to fix). The project targets practical, deployable baselines rather than chasing every SOTA micro-optimization: cross-architecture dynamic dispatch (AVX-512, AVX2, NEON, SVE2), non-temporal stores for IO-heavy throughput, and easy integration make it a useful tool for large-scale similarity search and sequence analysis, with ROCm/AMD acceleration still on the todo list.

Loading comments...

loading comments...