CI/CD for AI: Running Evals on Every Commit (focused.io)

🤖 AI Summary
A developer experimented with making LLM evaluations run on every code change—first via a local pre-commit hook, then by shifting evaluation into CI (GitHub Actions). The pre-commit approach used a Langchain dataset, an evaluator that scored responses from 0.1–1.0, and a Python script, but proved impractical: running full evals blocked commits for too long, and simply calling Langchain’s asynchronous aevaluate didn’t persist runs when the process exited. Increasing max_concurrency helped but wasn’t enough for larger datasets, and fire-and-forget subprocess hacks felt fragile. That led to the more realistic solution of running evals in CI. The post highlights why CI-driven continuous evals matter for AI/ML teams: they let you detect regressions precisely and often, enabling an “Eval-Driven Development” workflow. It also lays out practical technical guardrails for making evals actionable in CI: curate a stable dataset and evaluator, persist experiment runs (use an experiment prefix), and convert graded, non-binary results into pass/fail signals by (a) setting average thresholds, (b) enforcing per-example minimums to catch outliers, and (c) comparing current averages to recent runs (e.g., last three) via the Langchain SDK to track regressions over time. The result is a CI step that both prevents quality drops and guides targeted improvements to prompts and handling of uncovered cases.
Loading comments...
loading comments...