Enron Corpus (en.wikipedia.org)

0 points 245 days ago ago | visit original

🤖 AI Summary

The Enron Corpus is a publicly released archive of email data captured during the federal investigation into Enron’s collapse: roughly 600,000 messages from 158 employees (originally >160 GB of raw data) collected from Enron’s servers in May 2002. The Federal Energy Regulatory Commission preserved the emails and associated enterprise databases (including Oracle-hosted systems and the EnronOnline trading platform); a copy was later purchased and redistributed to researchers by Andrew McCallum. Researchers (notably Klimt & Yang, Shetty & Adibi) processed and re-released the data in accessible forms—Shetty & Adibi produced a MySQL dump in 2004 and EDRM’s v2 expanded release (2010) contains ~1.7M messages and is hosted on Amazon S3. For the AI/ML community the corpus is important because it’s one of the rare, large-scale, real-world email collections available without restrictive NDAs or heavy sanitization. It has been used extensively for email classification, social-network and link-analysis studies, sociolinguistic research, and as training/test data in NLP benchmarks (it is included in The Pile). Technical implications include ready applications for graph-based node importance algorithms, topic/temporal language-change analysis, and supervised/unsupervised models for email threading, spam/ham and authorship tasks. Its provenance and public-domain status make it a valuable reproducible resource, but users must still consider ethical and privacy concerns when applying modern ML techniques.

Loading comments...

loading comments...