🤖 AI Summary
This post is a self-contained diagnostic script for investigating mismatches between "engine" (vLLM or sglang) token log-probabilities and those computed directly from a Hugging Face (HF) transformers model — a common brittle point in RL workflows like PPO or importance-weighted updates. It automates generation via vLLM/sglang, extracts engine logprobs, runs a HF forward pass with sharded/offloaded base transformer, offloads hidden states to CPU, and recomputes chosen-token log-probs via a chunked head evaluation. It also includes plotting helpers (2D density, histograms, per-token Δprob) to visualize linear or log-space divergence.
Technically the script emphasizes numerical and device correctness: it computes logits with F.linear(h, W) where W is the lm_head weight on the head device, uses the log-sum-exp trick (max subtraction) and converts intermediates to float32 for stability, enforces torch.cuda.set_device to avoid cross-device kernels, uses CPU pinning and blocking CPU↔CUDA copies for deterministic transfers, and backs off chunk sizes on OOM. It supports bfloat16, FlashAttention2, device_map offloading, and reproduces engine sampling (noting vLLM’s caveat about logprobs vs sampling). For the AI/ML community this is a practical, reproducible tool to root-cause subtle logprob inconsistencies that can silently corrupt RL training and evaluation.
Loading comments...
login to comment
loading comments...
no comments yet