StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs (arxiv.org)

🤖 AI Summary
StreamTensor is a compiler framework that turns tensors into continuously streamed values across kernels on dataflow accelerators, targeting LLM inference bottlenecks. Rather than relying on bulk reads/writes to external memory, it encodes explicit stream layouts via a novel iterative tensor type system, which enables automated kernel fusion, buffer allocation, and memory-access optimizations. The compiler systematically searches three hierarchical design spaces—tensor tiling, kernel fusion, and resource allocation—to balance computational intensity, on-chip buffering, and streaming bandwidth for optimal throughput and memory efficiency. On FPGA LLM benchmarks, StreamTensor cuts latency to as low as 0.76× (≈24% reduction) versus a state‑of‑the‑art FPGA LLM accelerator and 0.64× (≈36% reduction) versus GPUs, while delivering up to 1.99× higher energy efficiency compared with GPUs. Those gains show that compiler-driven streaming and explicit stream types can significantly reduce external memory pressure, improve inter-kernel data reuse, and unlock more efficient dataflow implementations for large models. The approach is immediately relevant to accelerator designers and compiler engineers aiming to scale LLM inference on resource‑constrained, energy‑sensitive hardware.
Loading comments...
loading comments...