Building Repo Bench (repoprompt.com)

🤖 AI Summary
Repo Bench is a new, community-driven benchmark and problem-suite designed to measure how well large language models (LLMs) can make precise, multi-file code edits in realistic, noisy repository contexts. Born from the author’s experience using Opus/Sonnet models to build an entire Apple Vision Pro game, the bench codifies the hard practical failure modes of model-based code editing: finding unique search blocks, avoiding collateral edits, handling duplicate or decoy files, preserving formatting (e.g., balanced braces), and working within expensive token budgets. It pairs a robust apply_edits workflow (inspired by Aider’s diff-edit approach) with structured prompting so models output parsable edit descriptions rather than entire rewritten files—improving token efficiency and edit reliability. Technically, Repo Bench generates deterministic, seed-based problem variants (10 problem templates × easy/medium/hard) that scale difficulty by file size, number of decoys, and edit count, ensuring test instances aren’t memorized. Evaluation accounts for LLM nondeterminism by filtering out statistical outliers (IQR and >2σ), discarding responses >15% below the top, then taking a median-favoring-high score—yielding a stable leaderboard. The benchmark doesn’t claim to measure “intelligence” or optimal code quality; instead it quantifies a model’s adaptability to strict output formats and instruction fidelity in large, ambiguous contexts—skills central to reliable coding assistants and long-running agent workflows.
Loading comments...
loading comments...