A String Library Beat OpenCV at Image Processing by 4x (ashvardanian.com)

🤖 AI Summary
Albumentations — the hugely popular image-augmentation library — replaced parts of OpenCV with StringZilla’s LUT (look-up table) kernels after benchmarks showed StringZilla delivering up to ~4x higher throughput on common CPUs. On server-grade Intel and consumer Apple silicon, StringZilla’s lookup and translate routines hit multi-GiB/s rates (e.g., stringzilla::lookup_inplace ~9.9 GiB/s on long lines vs. opencv.LUT ~2.16 GiB/s on Intel), while Python’s built-ins and NumPy trails at an order of magnitude lower. That raw speed translates to faster augmentation pipelines, lower energy per image, and the ability to scale batch processing or training augmentations without new hardware. The secret is SIMD-aware LUT algorithms: for x86 AVX‑512, StringZilla partitions a 256-byte LUT into four 64‑byte tables and uses VPERMB-style permute lookups plus masked blends (VPBLENDMB/VPTESTMB) to do parallel byte translations efficiently, combined with a head/body/tail strategy to avoid unaligned-store penalties. On ARM, a NEON variant uses a uint8x16x4_t abstraction to achieve similar parallelism within 128‑bit vectors. The code also falls back to scalar for tiny inputs and offers SVE/SVE2 and CUDA kernels in v4. The result is a portable, high-throughput LUT primitive that’s now speeding image augmentation and even bioinformatics workloads (DNA byte mappings), showing how low-level SIMD tuning can outperform established libraries like OpenCV.
Loading comments...
loading comments...