Show HN: Speculative Decoding from Scratch in PyTorch (2.8x CPU Speedup) (github.com)

0 points 226 days ago ago | visit original

🤖 AI Summary

A new open-source PyTorch implementation of speculative decoding promises 2–3× faster LLM inference on CPU by fully implementing the speculative sampling pipeline (draft generation, parallel verification, rejection sampling) from scratch rather than relying on black‑box libraries. The repo demonstrates a 2.83× speedup on an Intel Core Ultra 5 225H using OPT-125M as a fast draft model and OPT-1.3B as the target, while preserving exact distributional fidelity — the authors mathematically prove the output matches standard autoregressive sampling. This makes the work both a practical optimization and a transparent educational example for the community. Technically, the engine generates γ speculative tokens with a small draft model, then verifies them in one batched forward pass of the target model. Each draft token is accepted with probability min(1, p_target/p_draft); rejected tokens are resampled from an adjusted target-based distribution so the final sequence distribution is exact. Empirical tuning shows γ≈3–4 gives the best tradeoff (peak throughput at γ=3 in some tests; γ=4 often yields strong speedups), and acceptance rates depend on task predictability (higher for idioms/structured text, lower for creative generation). Key system constraints are parallel verification bottlenecks and memory‑bandwidth limits on CPU. The repo (PyTorch + transformers) is ready for benchmarking and further experiments with draft/target pairs, hardware profiles, and draft training strategies.

Loading comments...

loading comments...