Messing with Bots (herman.bearblog.dev)

0 points 3 hours ago ago | visit original

🤖 AI Summary

A developer fighting scraper-driven load built a set of “babbler” honeypots that feed crawlers garbage content to waste their time and bandwidth. Starting with a Markov-chain text generator (implemented while learning Rust), they trained a model on a few hundred PHP files to emit realistic-looking but fake .php responses, then experimented with file sizes from a few KB up to MBs to maximize bot resource usage. When that proved inefficient on a VPS, they switched to an ultra‑light static approach: load Mary Shelley’s Frankenstein into RAM as paragraph nodes, serve a random index plus four paragraphs per request, and present five in-page links that trigger breadth-first crawl explosion, saturating greedy scrapers quickly. Technically this is an active-deception defense that exploits crawler behavior (breadth-first expansion and indiscriminate fetching of .php paths). Key implementation notes: memory-resident content for extremely low-latency responses, in-memory counters per deployment, noindex/nofollow attributes to avoid obeying well-behaved crawlers, and a separate static endpoint for fake PHP (since search engines typically ignore non-HTML). Caveats: it’s an arms race—if attackers scrape more efficiently you lose; search engines might penalize sites that appear to spam, so don’t run this on SEO-dependent projects; consider Cloudflare caching to preserve outbound transfer budget. For most sites, simple 403s remain the safest option.

Loading comments...

loading comments...