🤖 AI Summary
Agentic LLM workloads—agents that browse web pages, call tools, or track long tool-call trajectories—drive far larger context windows than chatbots and thus create extreme off‑chip memory traffic that starves on‑chip compute behind two “memory walls”: bandwidth and capacity. To address this, researchers introduced PLENA, a hardware–software co‑designed system that optimizes long‑context inference serving via three core pathways: reduced memory footprint through asymmetric quantization, a flattened systolic array with native FlashAttention support to reshape data movement, and a complete software stack (custom ISA, compiler, cycle‑accurate simulator and automated design‑space exploration) to map agent workloads efficiently.
Technically, PLENA’s architecture minimizes off‑chip transfers and improves on‑chip utilization by aligning compute primitives with attention patterns common in long contexts; the flattened systolic array accelerates memory‑efficient FlashAttention kernels while asymmetric quant reduces storage and bandwidth needs without symmetric precision costs. In simulation, PLENA achieves up to 8.5× higher utilization than existing accelerators and delivers 2.24× and 3.85× higher throughput than an A100 GPU and TPU v6e respectively under equal multiplier and memory budgets. The full stack will be open‑sourced, making PLENA a practical blueprint for building inference hardware that scales agentic LLMs with long contexts.
Loading comments...
login to comment
loading comments...
no comments yet