🤖 AI Summary
Independent blogger Chris Siebenmann reports that he’s actively blocking a surge of high-volume web crawlers—many apparently harvesting data for LLM training—that masquerade as old Chrome browsers. His site’s anti-crawler measures treat those outdated User-Agent strings as suspicious and deny access; legitimate readers using modern browsers who are blocked are asked to contact him with their exact User-Agent. He notes that some archival crawlers (archive.today, archive.ph, archive.is) behave indistinguishably from these actors by using old Chrome UAs, crawling from widely distributed IP blocks, and even presenting falsified reverse DNS records that claim to be Googlebot.
For the AI/ML community this is a concrete example of the tensions between large-scale data collection and site operators’ bandwidth and privacy defenses. Technically, it shows the limits of simple defenses (User-Agent blocking, IP heuristics, rDNS checks) when crawlers intentionally mimic real browsers or spoof identifiers, and it underscores how legitimate archival services differ in behavior—archive.org is cited as “better behaved.” The post highlights practical implications for dataset builders (risk of being blocked, contaminated archives, and ethical concerns) and suggests more transparent crawler behavior and better coordination with site owners to reduce friction.
Loading comments...
login to comment
loading comments...
no comments yet