A/B Tests over Evals (www.raindrop.ai)

🤖 AI Summary
Raindrop pushed back on recent claims from Braintrust that offline "evals" (LLM tests and scorers) are the future of AI product development, arguing instead that A/B testing and production monitoring are indispensable. The author — whose company monitors agents at scale, generating billions of labels monthly and clustering intents to detect issues — says evals are useful as sanity/unit tests but insufficient for modern, open‑ended agents. They rebut four core claims: that evals will replace testing-in-production, that evals measure real product quality, that they enable faster iteration, and that they scale to personalization. Technically, Raindrop emphasizes that true “ground truth” lives in production: agents are stochastic, long‑running, and interact with unpredictable environments and tools, so adversarial or static eval suites lag behind real user behavior. Their platform combines semantic signals (tiny, bespoke models that flag behaviors like loops, wrong language, or critical mistakes) with manual signals (thumbs up/down, model switching, regen rates) to run online A/B experiments against baselines. This lets teams safely route small traffic fractions to new models (e.g., GPT‑5) and instantly measure real impact. The implication for AI/ML teams is clear: retain offline evals for regressions, but invest heavily in lightweight production monitoring, signal engineering, and rapid experimentation to catch long‑tail failures and personalized behavior that static tests can’t predict.
Loading comments...
loading comments...