Should GPUs Make Free Trade Agreements? (www.doubleword.ai)

🤖 AI Summary
Doubleword proposes applying the economic idea of comparative advantage to batched LLM inference: split the work between GPUs based on what each is relatively best at. LLM inference has two phases — the compute-heavy prefill (FLOP-bound) and the memory‑bandwidth‑heavy decode. Rather than running identical stacks on each GPU, they suggest routing prefills to high‑FLOP cards (e.g., H100s) and decodes to GPUs with relatively better bandwidth or cost‑efficiency (e.g., A100/A10). Using disaggregated prefills (NVIDIA Dynamo-style) in a mixed fleet could increase total token throughput and lower cost per token, especially for batch-oriented workloads like OpenAI’s Batched API. This idea matters because many cloud/enterprise fleets are heterogeneous, and simple per‑GPU isolation wastes comparative strengths. Technical implications include expected throughput gains at the expense of per‑request latency and added system complexity: varying data‑type support (H100 fp8 vs older cards), differing parallelization strategies, and the need to reshape and transfer KV caches between GPUs. Doubleword plans to test this empirically (and cites a recent paper illustrating related gains). Realizing these benefits requires inference engines that can handle heterogenous precision, memory sizes, and KV orchestration — but if solved, heterogeneous inference could be a practical performance and cost win.
Loading comments...
loading comments...