🤖 AI Summary
The author extended a simple batch-invariance check from the Thinking Machines post into a rigorous property-based test using Hypothesis (repo: pbt-batch-invariance). Instead of fixed-size random draws, Hypothesis generates varied tensor shapes, random float contents, and arbitrary row slices m:n, and shrinks counterexamples to minimal failing cases. The properties tested include: for matmul a[m:n] @ b == (a @ b)[m:n]; RMSNorm implemented as x * torch.rsqrt(torch.mean(x**2, dim=-1, keepdim=True)) * gamma; and scaled dot-product attention via torch.nn.functional.scaled_dot_product_attention. “Batched” implementations use reductions across the batch dimension, while “rowwise” versions compute each row separately and stack results (correct but slower).
The tests found that rowwise implementations consistently pass, while batched kernels can violate batch-invariance—on CPU only batched matmul failed, but on GPU all batched versions (matmul, RMSNorm, attention) failed and their rowwise counterparts passed. Hypothesis-produced counterexamples (saved in test_outputs) demonstrate real kernel-level nondeterminism tied to how reductions and batching are implemented on accelerators. Implication: many high-performance kernels can introduce batch-dependent numerical differences that break deterministic inference for LLMs; the practical remedies are either batch-invariant kernel designs or per-row computation (with a performance cost), and property-based testing is a powerful tool to discover and minimize such bugs.
Loading comments...
login to comment
loading comments...
no comments yet