Quora has a super-handy post for data nerds listing big open source data sets you can use for testing:
http://www.quora.com/Big-Data/What-kinds-of-large-datasets-open-to-the-public-do-you-analyze-the-mostly
I would add to that the Enron e-mail corpus, which is great for testing anything against e-mails:
http://www.cs.cmu.edu/~enron/
There's "only" about 600,000 emails total, and of those, only about 300,000 are unique, but for an unstructured data set, it's good, and it's the gold standard.
Update:
Another useful list of large unstructured data sets:
http://www.quora.com/Where-can-I-get-large-corpora-open-to-the-public
http://www.quora.com/Big-Data/What-kinds-of-large-datasets-open-to-the-public-do-you-analyze-the-mostly
I would add to that the Enron e-mail corpus, which is great for testing anything against e-mails:
http://www.cs.cmu.edu/~enron/
There's "only" about 600,000 emails total, and of those, only about 300,000 are unique, but for an unstructured data set, it's good, and it's the gold standard.
Update:
Another useful list of large unstructured data sets:
http://www.quora.com/Where-can-I-get-large-corpora-open-to-the-public
No comments:
Post a Comment