🤖 AI Summary
ev is an open-source, lightweight CLI for locally evaluating and iteratively optimizing LLM prompts and agent behavior. It runs JSON test cases against a pair of Jinja2 prompt templates (system_prompt.j2 + user_prompt.j2), scores outputs against multi-criteria rules declared in eval.md, and automatically proposes and tests improved prompt versions—accepting a new snapshot only when it demonstrably outperforms the active version. The tool is designed for reproducible, deterministic hardening of prompts (repeatable cycles to expose edge-case flakiness), supports OpenAI and Groq providers via provider[name] model notation, and requires Python ≥3.12 (pip install evx).
Technically, each test lives under evals/<test>/ with cases/ (JSON inputs), eval.md (criteria headings, each treated as an independent criterion), schema.py (Pydantic response schema), and prompt templates. Runs are organized into cycles (repeat evaluations to reduce randomness), iterations (generate candidate prompts then re-evaluate), and a pass-rate metric that averages scores across criteria to avoid single-criterion dominance. Total model calls per run = cases × cycles × iterations. CLI actions include ev create, run, eval, list, copy, delete, version; configuration reads API keys from .env or environment variables. Output artifacts (summary.json, versions/, log.json) enable CI/dashboard integration and clean version history by ensuring each saved version is a strict improvement.
Loading comments...
login to comment
loading comments...
no comments yet