Wafer-Scale AI Compute: A System Software Perspective (www.sigops.org)

🤖 AI Summary
AI compute is moving beyond multi-chip GPUs to wafer-scale chips that put hundreds of thousands to millions of cores and tens of gigabytes of SRAM onto a single silicon wafer. The article frames this transition as driven by AI scaling laws and enabled by advances in packaging and process nodes (TSMC/industry interest), and it shows a working system, WaferLLM, that achieves sub-millisecond-per-token inference—proof that wafer-scale integration can deliver dramatic test-time efficiency. This matters because wafer-scale designs can reduce off-chip communication energy and latency by 10–100×, offer orders-of-magnitude increases in on-chip bandwidth, and enable tighter coupling of compute and memory for very large models. From a systems-software perspective the paper introduces PLMR (Parallelism, non-uniform Latency, constrained per-core Memory, constrained Routing) as a compact model of the hardware-software constraints developers must address. Wafer-scale wafers expose mesh NoCs, require asynchronous message-passing (shared-memory and global sync break down at million-core scale), and force data and compute partitioning into tiny local stores. Practical limits—per-core KB–MB local memory, thousand× slower remote accesses across many hops, and router header/addressing limits—mean compilers, runtimes and programming models must be rethought: minimize long-range traffic, favor locality and fixed neighbor sets, overlap compute/comm, and adopt PLMR-aware scheduling and data placement. The takeaway: hardware enables huge gains, but realizing them requires new distributed-memory-aware stacks, tooling and compiler/runtime co-design.
Loading comments...
loading comments...