🤖 AI Summary
RevenueCat ran simulations to test the common claim that Bayesian A/B tests are immune to “peeking” (stopping as soon as the posterior crosses a threshold). They simulated two arms with identical Bernoulli conversion rates (r ∈ {0.1%, 1%, 10%}), started from an uninformative Beta(1,1) prior, and updated the posterior after every N observations (N ∈ {10^2, 10^3, 10^4, 10^5, 10^6}). Using a stopping rule that declares a winner when P(B>A) > 0.95, they found false positive rates rose dramatically as peeking became more frequent — for example, checking every 100 observations produced ~80% false positives even though the arms were identical.
The takeaway for the AI/ML community: Bayesian posteriors remain interpretable at any sample size (you can read P(B>A) meaningfully whenever you look), but treating a fixed posterior threshold as a frequentist test and stopping on success inflates Type I error just like optional stopping in classical tests. If you need to control frequentist error under continuous monitoring, you must use explicit sequential methods or pre-specified stopping rules (group-sequential designs, adjusted thresholds, Bayes-factors/decision-theoretic criteria, or other calibration), rather than relying on unconstrained peeking with a 0.95 posterior cutoff.
Loading comments...
login to comment
loading comments...
no comments yet