LLM Inference Economics from First Principles (www.tensoreconomics.com)

🤖 AI Summary
This analysis builds an end-to-end, first-principles model of LLM inference economics using Llama 3.3 70B as a concrete example to show why the GPU is the dominant cost and how that cost maps to price-per-token. The core pricing intuition is simple: token unit cost ≈ GPU cost per hour ÷ tokens produced per hour. For Llama‑70B (≈70.55 billion parameters stored in bfloat16), weights occupy ~141 GB — larger than a single A100/H100 card — so production typically requires 4–8 GPUs per model instance. That hardware footprint plus token throughput determines provider margins and how cheaply labs can generate synthetic data or democratize access. Technically, LLM inference toggles between a compute‑bound prefill (prompt) phase and a heavily memory‑bound decoding phase; arithmetic intensity (FLOPs per byte moved) therefore controls utilization. GPUs like the A100 (3.12×10^14 FLOPS, 2.03×10^12 B/s HBM) and H100 (9.89×10^14 FLOPS, 3.35×10^12 B/s) have very different compute and bandwidth ceilings, but real LLM runs are often limited by memory bandwidth. Practical points: matrix-multiply FLOP counting is approximated as 2mno, Llama uses Group Query Attention (K/V are 1/8 size of Q) affecting parameter and compute shapes, and RoPE adds a small per-element FLOP cost. Thus the main optimization target to improve economics is raising arithmetic intensity in the memory‑bound phase to better exploit GPU FLOPS and lower cost/token.
Loading comments...
loading comments...