Analog in-memory computing attention mechanism fast and energy-efficient LLMs (www.nature.com)

🤖 AI Summary
A new breakthrough in analog in-memory computing (IMC) offers a fast, energy-efficient hardware solution for the self-attention mechanism central to large language models (LLMs). Traditional GPUs face significant latency and energy bottlenecks due to costly memory transfers of cached token projections during sequence generation. This work introduces a custom IMC architecture using emerging charge-based gain cells that simultaneously store KV cache projections and perform parallel analog dot-product computations, drastically reducing both latency and energy usage—up to two and four orders of magnitude respectively compared to GPUs. Key technical innovations include leveraging gain cells’ fast write speeds, multi-level storage, and non-destructive reads to handle dynamic token caching efficiently. The architecture circumvents analog non-idealities through a novel initialization algorithm that adapts pre-trained models like GPT-2 to the hardware without retraining, maintaining competitive accuracy. Additionally, the design executes the core attention operations fully in the analog domain using charge-to-pulse circuits, avoiding power-hungry analog-to-digital converters and enabling scalable, low-latency inference. By combining analog IMC with a software-hardware co-optimization approach, this work marks a significant step toward ultrafast, low-power transformers, addressing a major bottleneck in deploying large-scale generative AI models at scale.
Loading comments...
loading comments...