Claude.md: Best Practices for Optimizing with Prompt Learning (arize.com)

🤖 AI Summary
Researchers applied a Prompt Learning loop to Claude Code (powered by Claude Sonnet 4.5) by optimizing its system prompt—via CLAUDE.md or the CLI --append-system-prompt—to improve code-generation performance without changing models or tooling. The loop uses two train/test splits (by-repo to test generalization, and within-repo to mimic a developer workflow), runs Claude Code on training issues to produce patches, executes unit tests for binary pass/fail signals, then asks an LLM to produce rich, diagnostic feedback on each patch. A meta-prompting stage consumes those LLM evals to generate revised system-prompt “rules,” which are re-applied and iterated until performance plateaus or costs limit further tuning. The results are notable: a +5.19% absolute test accuracy gain on the by-repo split and +10.87% on the within-repo split, demonstrating both general coding improvements and strong repo-specific specialization. Key technical takeaways: using LLM evals (explanations of failure modes) yields much richer signals than scalar rewards, and meta-prompting effectively optimizes system prompts rather than model weights. Practically, this shows developers can meaningfully boost top-tier coding agents by curating instructions (CLAUDE.md) to encode repo conventions, testing practices, and pitfalls—scalable to any agent that exposes a system prompt.
Loading comments...
loading comments...