Mitigating Aggressive Crawler Traffic in the Age of Generative AI (journal.code4lib.org)

🤖 AI Summary
Since Spring 2024 the University of North Carolina Chapel Hill Libraries — like many memory institutions worldwide — has been hit by successive waves of automated web crawlers tied to generative AI use. Two distinct waves emerged: an early wave from commercial cloud IPs targeting full text and media, and a later, more evasive wave originating from residential ISPs that mimicked common browser user-agents and distributed requests across large proxy pools. The traffic looked and behaved like a distributed denial-of-service attack, overwhelming servers and causing multi-day outages for catalogs and digital collections. An informal survey and parallel industry polls found roughly two-thirds to three-quarters of libraries and repositories are actively battling similar bot traffic, underscoring a systemic problem for the cultural heritage sector. UNC’s technical response evolved from standard IP/user-agent blocking to layered defenses: fail2ban with progressive ban windows (1 day → 1 year), cloud load balancers and WAFs, request throttling, and a novel facet-based detection technique. Analysis revealed bots issuing highly unusual, resource-intensive facet combinations (e.g., thousands of identical “Finnish + Music” facet searches in one day, and massive counts of queries with 15+ facets), which enabled targeted blocking of abusive query patterns rather than broad user-agent/IP blocks. The case highlights the need for coordinated sharing of indicators, community-run blocklists and heuristics, and joint vendor/hosting strategies to protect library infrastructure from adaptive crawlers supporting generative AI.
Loading comments...
loading comments...