Claude Can (Sometimes) Prove It (www.galois.com)

🤖 AI Summary
Anthropic’s Claude Code — an AI coding agent that breaks large tasks into substeps and runs CLI tools — can now meaningfully assist with interactive theorem proving in Lean. In a test formalizing the Deny‑Guarantee Reasoning paper, the agent wrote and iterated on proofs using an automated loop (run lake build, read errors, hypothesize fixes, modify code) and reached about 50% of the planned formalization (2,535 lines of Lean, ~1,232 lines of proof). It handled not only individual theorems but higher‑level “proof engineering” tasks: choosing abstractions, refactoring (e.g., swapping types for finite maps), searching code, and using tools like lean‑mcp‑lsp to query proof state. The agent’s workflow is cloud‑backed and interactive, with user confirmations for commands to mitigate risks. This matters because it points to a future where ITP is less the preserve of few expert “proof grinders” and more accessible via AI project managers that can orchestrate multi‑step development. Technically, Lean’s strict, machine‑checkable failures make it well suited to iterative agents that can diagnose and repair code; providing richer machine‑readable diagnostics and agent‑friendly tooling improves results. Caveats are significant: the agent is slower than an expert, sometimes pursues blind alleys or embeds persistent conceptual errors that are costly to undo, and many definitions still need human auditing. In short, Claude Code is a promising step toward automating proof engineering, but not yet a substitute for expert oversight.
Loading comments...
loading comments...