A Meta-index of Data Sets
I had to go hunting around for some data to try some new ideas on recently. As handy as Google is, there’s still a fair bit of chaff from which to sort the wheat.
Fortunately, there is a lot of good stuff out there including well-organised indexes of data sets for various purposes. For my future reference (and for anyone else that may be interested) here are some of the better data set lists I found.
- UCI Repositories: No list of lists would be complete without this perennial collection of machine learning data sets hosted by the University of California, Irvine. They also have a repository of large data sets for knowledge discovery in databases (KDD).
- The Info: This site “for people with large data sets” has a community editable list of data sets organised by topic. The collection here has a web/text focus.
- Text Retrieval: This list kept by NIST has data sets for each of the various tracks at the Text Retrieval Conference, including data sets for spam detection, genomics, and a terabyte track (although the data sets aren’t quite up to a terabyte yet).
- Time Series Data Library: This collection has a large number of time varying data sets from finance, demography, physics, sport and ecology.
As well as the above institution or community organised lists, I also came across some maintained by individuals.
- Daniel Lemire: Daniel Lemire’s “Data for Database Research” is organised by application areas, including data for earthquakes, weather, finance, climate and blogs.
A few specific data sets caught my eye, some new, and some I just hadn’t seen before.
- Freebase Wikipedia Extraction: The Wikipedia WEX data set is essentially a large (57 GB) graph of articles from wikipedia.
- Enron Email: This collection of email (400 Mb compressed) between Enron staff contains about half a million messages organised into folders. It was released publicly as part of the investigation into Enron and has been used by William Cohen and others as part of the CALO project.
- Freeway Traffic Analysis: This fairly large data set is a record of traffic flow on several lanes of the I–880 freeway in California in order to study the effect of roving tow-trucks on dealing with decongesting traffic incidents.
If all else fails and you still cannot find a suitable data set for your research, you can always invoke the social web and trawl through bookmarks on services like del.icio.us. The global data set tag can throw up some interesting hits occasionally but there might be a higher wheat to chaff ratio in particular user’s bookmarks, such as Peter Skomoroch. Mine is not nearly as comprehensive yet.
It would be interesting to do a meta-analysis of all these data sets to see how our ability as a discipline to deal with larger and more complex data sets has increased over time. As Daniel Lemire pointed out with some surprise recently, processing a terabyte of data isn’t that uncommon.
February 22, 2008