Measuring FPGA vs ARM on Pynq-Z2: Tiny MLP, huge AXI/DMA Overhead (github.com)

🤖 AI Summary
A hands-on lab notebook measured where latency really goes when you move a tiny quantized MLP into a Zynq Pynq‑Z2 datapath. The author built two lanes on the same SoC — an ARM “reflex” lane (Python/C on the Cortex‑A9) and an FPGA “neuro” lane (traffic generator → feature pipeline → 4→32→1-ish MLP with weights in BRAM) — and instrumented everything with on‑chip timers. The MLP math itself is tiny (~64 cycles ≈ 0.5 µs at 125 MHz), but the surrounding fabric/shell dominates: a roughly 140k‑cycle shell (1.0–1.3 ms no‑DMA) and expensive S2MM DMA pushes the full FPGA lane to ~3.4–3.7 ms. By contrast the ARM reflex path is ~16–20 µs (p50 ≈ 17 µs, p99 ≈ 39 µs), so on this SoC the CPU is ~100× faster for this workload. The repo includes four overlays (Full, MLP‑only, No‑DMA, Core‑probe) and scripts that produce the CDFs and per‑iteration traces that make these breakdowns explicit. The takeaway for ML/FPGA practitioners: the compute can be essentially free, but generic AXI infrastructure, width converters, FIFOs, PL/PS handshakes and DMA/IP glue impose huge cycle costs on teaching boards. That’s not a bug — it’s how Zynq/stock IP are designed for flexibility. Real low‑latency NIC/FPGAs use hardened MACs, much higher clocks (400–800 MHz), and extremely thin streaming shells or sideband control so the hot path is combinational/state‑machine sized, not a generic S2MM round trip. If you care about microsecond latency, budget cycles for the shell and data movement first — shaving MLP cycles alone won’t get you there.
Loading comments...
loading comments...