Systematically generating tests that would have caught Anthropic's top‑K bug (theorem.dev)

🤖 AI Summary
Researchers introduced an automated testing pipeline that uses “fractional proof decomposition” to generate targeted property-based unit tests capable of finding rare, production‑critical bugs—demonstrated by reproducing Anthropic’s approximate top‑K TPU bug without using Anthropic’s hand‑written reproducer. The team encodes an end‑to‑end theorem as a Hypothesis PBT (for example, ∀prompt,k: LLM_top‑1(prompt) ∈ LLM_top‑k(prompt) ) and then recursively decomposes it into smaller, checkable subtheorems (e.g., max(approximate_top_k(arr,k)) == max(arr); logits are finite; token IDs align with logprob keys in vLLM). The generated unit tests run on Colab and found the top‑K exclusion bug after ~10M samples and an XLA/TPU excess‑precision issue in seconds. Technically, fractional proofs are “fractional” components of a brute‑force correctness proof: by breaking an end‑to‑end property into logically composing PBTs, the input space for each subtest becomes small enough to sample efficiently while still guaranteeing composition covers the original theorem. This shifts testing cost from being proportional to bug rarity to roughly logarithmic in rarity, letting teams catch low‑probability edge cases without infeasible compute. The approach is extendable to real codebases (e.g., libtpu implementations or TPU→cluster composition) and supports automating model‑based reasoning about program correctness for earlier, cost‑effective bug detection.
Loading comments...
loading comments...