🤖 AI Summary
Reddit has sued Perplexity and three data‑scraping firms, alleging they illegally obtained Reddit content after the company laid a deliberate trap: a “marked” test post that was configured to be crawlable only via Google’s search results (Reddit has a content-licensing deal with Google). Within hours of posting, Perplexity’s answer engine was reproducing that unique content, which Reddit says could only have happened if Perplexity or its suppliers scraped Google SERPs or otherwise circumvented Reddit’s anti‑scraping measures. The complaint names Oxylabs, SerpApi and AWM Proxy as potential intermediaries that may have harvested and sold the data; Perplexity denies training models on Reddit content and says it defends openness, while the scraping firms say they will contest the claims.
The case is significant because it highlights technical and legal fault lines in how LLM builders source web data: search‑result scraping, proxy services and botnets can defeat page‑level restrictions (robots.txt or site guardrails), creating exposure for downstream AI providers even if they claim not to train on specific sites. Reddit’s “marked bill” tactic mirrors Cloudflare’s earlier tests and could become a model for proving data provenance in court. The dispute raises practical implications for compliance, auditing and defensive engineering (stricter site controls, provenance tracking, contractual licensing), and could shape liability and licensing norms for large‑scale model training going forward.
Loading comments...
login to comment
loading comments...
no comments yet