(Ab)using Agentic Coding CLIs for Data Cleaning and Standardisation (abifog.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

In about 20 minutes a single LLM session plus an “agentic” CLI turned a messy 1,000+‑row, user‑generated table into a tidy, query‑ready dataset. The author used Qwen Code (noting OpenAI Codex or Gemini CLI would work) to run shell commands and a psql wrapper defined in AGENTS.md, letting the LLM inspect schema, count distinct values, propose standardized mappings, generate and execute UPDATE statements, and verify row counts. Typical fixes included normalizing variants like “Marvel 616” / “Marvel, Earth‑616” to “Marvel Comics (Earth‑616)” and consolidating DC naming. To scale, the LLM produced a universe_standardization.csv which was reviewed and converted into a batched loop of UPDATEs. For moderation, a migration added an approved column; the LLM iteratively scanned rows in 50‑row batches, flagged slurs and explicit content, and unapproved offending entries. Safety controls included make backup-postgres and restore-checkpoint commands and human approval before risky actions. This workflow highlights a pragmatic alternative to building bespoke agent toolchains: agentic CLIs plus a shell tool dramatically reduce development overhead while keeping a human in the loop. Key technical takeaways: expose a limited set of safe tools (psql wrapper, backups), minimize LLM decision points for determinism, use CSVs for batch standardization, and include explicit rollback and whitelisting for higher‑risk operations. The approach is fast and practical for non‑critical cleanup, but demands strong safety controls when applied to production data.

Loading comments...

loading comments...