ai.robots.txt – A list of AI agents and robots to block (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

A community-maintained GitHub project, ai.robots.txt, publishes a curated list of AI/ML crawlers and ready-to-deploy blocking rules so site owners can opt out of automated scraping. The repo provides a canonical robots.txt (implementing the Robots Exclusion Protocol RFC 9309), plus server-ready snippets for Apache (.htaccess), Nginx (includeable conf), Caddy (Header Regex + abort handler), HAProxy (ACL + http-request deny example), and a Traefik middleware to serve rules on-the-fly. Contributors add entries to robots.json; a GitHub Action regenerates robots.txt, .htaccess, nginx config and metric tables. The project also links to testing scripts (Python tests.py), an RSS release feed, and sourcing help from projects like Dark Visitors. There’s support for reporting noncompliant crawlers through Cloudflare and an option to license site content to AI firms using the Really Simple Licensing (RSL) standard — with a WordPress plugin that adds RSL + payment processing. For the AI/ML community this centralizes practical defenses and governance tools against indiscriminate data scraping—helpful for researchers, commercial sites and publishers wanting to control training-data exposure. Technically it lowers friction for operators across popular stacks to reject or identify AI bots, but it also highlights limits: robots.txt is advisory and some crawlers ignore it, so pairing server-level blocks or third-party enforcement (e.g., Cloudflare hard blocks) remains important. The project’s automated workflow and multi-server snippets make it a useful operational resource in the evolving arms race over web-sourced training data.

Loading comments...

loading comments...