How we helped a YC company (Upsolve) catch a GPT-5 regression (www.arthur.ai)

🤖 AI Summary
Upsolve, a YC-backed analytics platform, partnered with Arthur’s Forward Deployed Engineering to instrument and continuously evaluate its Analysis AI Agent — a multi-step planning agent that answers natural-language data questions by producing a SQL query, a chart, and a final answer. After integrating tracing (token counts, latencies, vector DB retrieval, model inputs/outputs) and building gold‑standard interaction datasets, Arthur’s platform ran automated Test Runs that replayed those records against candidate deployments. When Upsolve upgraded to GPT‑5, Arthur’s evals immediately flagged a serious regression that would likely have affected customers — a problem that might otherwise have gone unnoticed. Technically, Upsolve used domain-specific, binary scoring across the agent’s three outputs (0/1) via an LLM-as-a-judge prompt that codified criteria (e.g., semantic equivalence of SQL). Continuous Evals plus telemetry enabled quantitative before/after comparisons and surfaced regressions at the agent-component level. The system also exposed evals and tuning tools to customers so they could provide training examples and tweak behavior, closing the Agent Development Lifecycle (ADLC) loop. The case underscores how observability, reproducible test runs, and domain-aware automated evaluation are essential for safely shipping agentic AI and catching model-upgrade regressions early.
Loading comments...
loading comments...