System Eval with Obsidian and Claude Code (interjectedfuture.com)

🤖 AI Summary
A developer experimented with a lightweight system-evaluation stack built from Obsidian (markdown files as a knowledge base) and Claude Code (an agent that can read files and generate/execute code). Instead of building heavy infra, they represent the entire eval as an Obsidian markdown document divided into sections: each section contains a prompt (or pseudocode), a command line to run the resulting script, and a link to the output. Claude Code reads the prompt from disk, one-shots simple scripts (e.g., converting queries into traces), and produces outputs you can run like: python3 scripts/convert_discord_json_to_md.py sources/lesson_1 queries. Traces and annotations live as markdown files with frontmatter for judgement and annotation, enabling error analysis and iteration without a database or bespoke UI. Significance for the AI/ML community is practical: it’s a reproducible, shareable, git-native eval workflow that leverages a “narrow-waist” of plain text files so an agent can access full context on disk. That matters because many AI failures are context failures, not capability limits—giving an agent direct access to a curated knowledge base improves fidelity and repeatability. The pattern shows how simple tooling combinations can enable end-to-end system evals, lightweight error analysis, and easier collaboration, trading complex orchestration for disciplined markdown + agent-driven scripting.
Loading comments...
loading comments...