Blocking LLM crawlers, without JavaScript (www.owl.is)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researcher Uggla published a practical, server-side trick to deter sloppy LLM/web scrapers without relying on JavaScript or client-side proof-of-work. The idea: declare a “poisoned” path (e.g., /heck-off/) in robots.txt, serve a lightweight interim HTML page to cookie-less requests that contains a deliberately misplaced meta-refresh and a hidden nofollow link to that poisoned path, and set cookies to track crawler behavior. Requests to /heck-off/ receive Set-Cookie: slop=1; requests to /validate/ get Set-Cookie: validated=1 and are redirected back. On normal pages, block clients bearing the slop cookie and allow those with the validated cookie. Cache-Control must be set to no-cache, no-store, must-revalidate to avoid redirect loops. Why it matters: this is a low-cost, no-JS approach that exploits common crawler weaknesses—ignoring robots.txt or poorly implemented HTML parsing—so it catches many automated scrapers used to harvest training data for LLMs while preserving access for well-behaved search engines. It’s computationally cheap, has no false positives by design, and is easy to deploy if you can control headers. Limitations are clear: sophisticated or updated crawlers can adapt, and it doesn’t stop crawlers that obey robots.txt or correctly handle cookies/redirects.

Loading comments...

loading comments...