How we made AWS Trainium 17x faster (for conv1d) (charleshong3.github.io)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Autocomp — an automated kernel optimizer — transformed AWS Neuron’s conv1d_depthwise_default kernel for Trainium and achieved a 17.37× speedup (latency from 8.007 ms to 0.461 ms) for a representative shape (N=8, in/out channels=512, width=2048, filter_width=3). The work systematically explored NKI/Trainium-specific transformations (SBUF and PSUM usage, DMA vs. load, loop fusion and interchange, tiling) to reduce memory pressure and enable aggressive compiler fusion, showing how an automated tool can find non‑obvious, hardware-aware optimizations that human authors might miss. Key technical moves included tiling channels into 128‑channel groups (matching Trainium’s preferred execution), using PSUM to cut SBUF traffic, swapping loop order to reuse filter weights across images, and finally hinting a width tile of 64 that signaled the compiler to fuse operations across 128 partition tiles. That last change collapsed ~65,536 tensor_tensor calls and ~16,384 tensor_reduce calls down to 512 each, dropping those hotspots from ~70% of runtime to ~17% (e.g., tensor_tensor time 3.84→0.2 ms). The case study demonstrates how targeted, architecture-aware compilation and tiling/fusion strategies can dramatically accelerate convolution kernels on specialized accelerators and suggests similar gains are attainable for other ML primitives on Trainium.

Loading comments...

loading comments...