Monday, July 22, 2013

Test Data

Quora has a super-handy post for data nerds listing big open source data sets you can use for testing:

I would add to that the Enron e-mail corpus, which is great for testing anything against e-mails:

There's "only" about 600,000 emails total, and of those, only about 300,000 are unique, but for an unstructured data set, it's good, and it's the gold standard.

Another useful list of large unstructured data sets:

