Show HN: E2E Testing for Chatbots (github.com)

🤖 AI Summary
SigmaEval is a new Python framework for statistical, end-to-end testing of conversational AI — chatbots, virtual assistants and other LLM-powered apps — that replaces ad-hoc, subjective checks with hypothesis-driven evaluation. You define human-readable ScenarioTests (Given-When-Then) that set an objective quality bar (e.g., "75% of responses score ≥7/10" or "95% of responses <1.5s latency"), then an AI User Simulator generates many realistic interactions and an AI Judge scores each conversation on a 1–10 rubric. SigmaEval aggregates those scores and runs one-sided statistical tests (e.g., proportion_gte, median_gte, metrics.proportion_lt) with configurable sample_size and significance_level (alpha) to produce pass/fail results with confidence intervals — like running a clinical trial for your bot. Technically, SigmaEval uses two LLM agents (simulator + judge), supports 100+ model providers via LiteLLM (examples show Gemini/OpenAI keys), and offers a fluent ScenarioTest API plus stateful app_handler callbacks for integrating real apps. It applies bootstrap tests for medians, binomial-style/proportion tests for quality and metric-based assertions for latency/turn counts, enabling SLO-style guarantees and integration with test suites. The result: reproducible, quantitative quality claims and actionable thresholds for deploying and iterating LLM features.
Loading comments...
loading comments...