Goofing on Meta's AI Crawler (bruceediger.com)

🤖 AI Summary
A hobbyist blogger discovered in March 2025 that Meta’s crawler (user agent meta-externalagent/1.1) was aggressively scraping their site, eventually peaking at ~270,000 URL requests/day. To probe it, they used an Apache rule to route requests from that UA to a PHP “bork.php” that generated an “infinite” stream of randomized HTML and links (with a ~14s mean delay). Over the experiment the site served ~8.9M 200 OK responses (Mar–Jun) and then, after switching to 404s, logged ~6.2M 404s (Jun–Nov). Meta hit from hundreds of IPv6 addresses in 2a03:2880::/29 and a smattering of IPv4 ranges, with IPv6 accounting for ~15.1M requests vs ~7.9k from IPv4 — showing a strong IPv6 preference. Technically revealing behavior: the crawler overwhelmingly requested text-like suffixes (.html/.htm/etc — ~87% of requests) and almost avoided image and torrent-style URLs, while oddly requesting .mp3 links. That pattern aligns with LLM training priorities (text over media) and raises copyright and load-cost concerns for small hosts. The experiment demonstrates practical defenses (scraper honeypots, targeted 404/503 responses, per-host rate limiting) and the ethical/legal friction of large companies harvesting web content for models. The author argues for individualized, idiosyncratic “junkyards” as a decentralized countermeasure but cautions about cost and legal ambiguity.
Loading comments...
loading comments...