Codegen Scorer – evaluate the quality of code generated by LLMs (github.com)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Google’s Angular team released Web Codegen Scorer, a CLI tool and reporting UI to evaluate the quality of web code generated by LLMs. Unlike broad code benchmarks, it focuses specifically on web apps and uses established quality measures so teams can make evidence-based decisions: iterate on system prompts, compare models, and monitor generated-code quality as models and agents evolve. It’s framework-agnostic (works with Angular or any web library) and intended to replace ad-hoc trial-and-error with repeatable, consistent assessments. Technically, the scorer runs end-to-end evals (installable via npm) against OpenAI, Anthropic, and Gemini models (API keys provided as env vars) and supports configurable environments, runners (genkit or gemini-cli), concurrency, sampling limits, RAG endpoints, and local re-runs to avoid LLM costs. Built-in checks include build success, runtime errors, accessibility, security, LLM rating, and coding best practices, plus automatic repair attempts and a report viewer for comparisons. CLI flags (--model, --env, --local, --limit, --concurrency, --skip-screenshots, etc.) let teams tailor runs. Roadmap items include interaction testing, Core Web Vitals measurement, and assessing LLM-driven edits—making this a practical tool for ML engineers and researchers to quantify, compare, and improve code generation in production workflows.

Loading comments...

loading comments...