A HTTP User-Agent that claims to be Googlebot is now a bad idea (utcc.utoronto.ca)

0 points 7 hours ago ago | visit original

🤖 AI Summary

A widely shared blog post from February 2025 warns that many site operators are now blocking traffic that presents old Chrome browser User-Agent strings because a recent surge of high‑volume crawlers—partly attributed to data collection for LLM training—have been using those legacy UAs. Some archival services and scrapers also crawl from widely distributed IP ranges and even present falsified reverse‑DNS entries claiming to be Googlebot, making simple UA checks and rDNS lookups unreliable. The author reports intentionally blocking such traffic to protect site resources, and recommends using better‑behaved archives (e.g., archive.org) or contacting the site owner if access is blocked. For the AI/ML community this is a reminder that User-Agent spoofing and rDNS fakery undermine implicit trust models for web crawling. Consequences include increased blocking of scrapers (potentially losing legitimate data), thornier provenance and consent issues for training corpora, and greater need for robust crawler identification and ethical practices. Builders of datasets should prefer transparent, rate‑limited crawlers with clear contact info, validate IP ownership (e.g., whois/PTR checks combined with ISP verification), respect robots.txt and site owner requests, and consider using reputable archives or explicit data‑licensing APIs instead of stealthy scraping that relies on forged UAs.

Loading comments...

loading comments...