Characterizing Realistic Workloads on a Commercial Compute-in-SRAM Device (arxiv.org)

🤖 AI Summary
The paper presents the first comprehensive performance and energy characterization of a commercial compute-in-SRAM device (the GSI APU) running realistic workloads, and introduces an analytical framework that models the performance trade-offs of general-purpose compute-in-SRAM architectures. Crucially, the authors identify that exploiting the device’s fine-grained parallelism hinges on careful data management, so they propose three concrete optimizations—communication-aware reduction mapping, coalesced DMA, and broadcast-friendly data layouts—that reduce on-chip communication and improve off-chip bandwidth utilization. All device components were profiled on real hardware, while shared off-chip memory bandwidth was modeled with simulated HBM to study scalability. Applied to retrieval-augmented generation (RAG) over large corpora (10–200 GB), these optimizations accelerate retrieval 4.8×–6.6× versus an optimized CPU baseline and cut end-to-end RAG latency by 1.1×–1.8×. The system matches an NVIDIA A6000 GPU’s RAG performance but at dramatically lower energy cost (54.4×–117.9× reduction). For the AI/ML community this validates compute-in-SRAM as a viable, highly energy-efficient platform for memory-bound, data-intensive tasks and provides actionable guidance—both an analytical model and code-level mappings—for optimizing future memory-compute hardware and workloads.
Loading comments...
loading comments...