🤖 AI Summary
A week‑long experiment planted an “infinite nonsense crawler trap” and found that modern scrapers—bots harvesting web content to train LLMs—quickly dominated server traffic. These aren’t polite search crawlers: they ignore robots.txt, spoof browsers, rotate thousands of IPs (sometimes per request), and hammer sites with multiple requests per second. Serving static pages and images can become costly (e.g., ~100 kB average file size means four requests/sec ≈ 1 TB/month), and common defenses—IP blocks, rate limits, paywalls, CAPTCHAs, gzip bombs or returning 404s—either fail or provoke more aggressive behavior.
The author’s pragmatic defense was to “feed the bots” dynamically generated garbage: a lightweight Markov babbler that responds with cheap, nonsense content. It avoids disk I/O, uses ~60 CPU microseconds per request and ~1.2 MB RAM, and requires no blacklist maintenance—absorbing scraper pressure far more cheaply than serving static assets. The broader significance: web operators face an emerging externality from LLM data collection and may adopt low-cost baiting as a mitigation, but that in turn risks contaminating training corpora with synthetic noise and fuels an arms race between scrapers and site defenses.
Loading comments...
login to comment
loading comments...
no comments yet