Exploring Thompson Sampling for model selection (sourcepilot.co)

🤖 AI Summary
SourcePilot announced they built an LLM arbitrage system that automatically learns which model to use per user by framing the problem as a multi-armed bandit and using Thompson Sampling in production. Rather than hard-coding preferences or using fixed exploration rates, the system models each candidate LLM with a Beta distribution (α = accepts, β = denies), samples a potential success rate from each Beta on every request, and selects the model with the highest sample. Cold-starts fall back to random exploration until enough interactions exist; they also allow an optional small epsilon exploration. This gives a Bayesian, adaptive, stochastic policy that naturally balances exploration and exploitation and is provably efficient (logarithmic regret). Technically, SourcePilot implements Beta sampling by drawing two Gamma variates (using Marsaglia & Tsang) and forming X/(X+Y), with a Box–Muller normal sampler for the Gamma routine, and they add a +1 prior (Beta(1,1)) for unobserved models. Per-user stats (accepts, denies, retries, avg response time, success rate) are persisted and updated on feedback; response time uses an exponential moving average. Compared to random, epsilon-greedy and UCB, Thompson Sampling better handles uncertainty, gives untested models fair chances, and produces probability-matching behavior—making it a practical, low-friction solution for personalized model selection across cloud and local LLMs.
Loading comments...
loading comments...