The old church where one trillion webpages are being saved (www.cnn.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

The Internet Archive, housed in a converted church in San Francisco, has spent nearly 30 years collecting and preserving roughly one trillion webpages through its Wayback Machine. CNN visited the Archive’s headquarters to show how the organization is adapting to the “AI age”: scaling web crawls and long-term storage, improving indexing and metadata, and hardening both digital and physical protections against political pressure, legal challenges, and disasters that could threaten irreplaceable web history. For the AI/ML community the Archive is both a resource and a responsibility: its vast, time-stamped corpus is crucial for training language and multimodal models, auditing dataset provenance, and reproducing experiments, but it also raises technical and legal challenges around copyright, data quality, deduplication, and bias. Preserving one trillion pages requires huge bandwidth, petabytes of storage, efficient indexing and retrieval systems, and clear metadata and licensing signals so researchers can build accountable models. The Archive’s efforts to make web history robust and accessible therefore directly affect model transparency, dataset curation, and the long-term reproducibility of AI research.

Loading comments...

loading comments...