Extracting books from production language models (2026) (arxiv.org)

🤖 AI Summary
Researchers have conducted a significant study on the extraction of text from production language models (LLMs), focusing on the implications of data memorization and copyright issues. Using a two-phase extraction procedure, the team tested four leading models—Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3—to determine their susceptibility to "memorizing" copyrighted content from their training datasets. Interestingly, the study revealed that substantial amounts of copyrighted material can still be retrieved from production models, despite the safeguards typically employed to prevent such leakage. For instance, in certain configurations, Claude 3.7 Sonnet reproduced content nearly verbatim with a recall rate of 95.8%, while GPT-4.1 struggled significantly more, demonstrating a recall rate of only 4%. This research underscores the ongoing legal and ethical challenges faced by the AI/ML community regarding LLMs and their training data. It raises critical questions about the safety and reliability of production models, especially given their potential to infringe on copyright protections. The findings highlight a pressing need for enhanced security measures and clearer regulations around data usage in machine learning to safeguard intellectual property while developing these powerful technologies.
Loading comments...
loading comments...