I had to go hunting around for some data to try some new ideas on recently. As handy as Google is, there’s still a fair bit of chaff from which to sort the wheat.
Fortunately, there is a lot of good stuff out there including well-organised indexes of data sets for various purposes. For my future reference (and for anyone else that may be interested) here are some of the better data set lists I found.
DMOZ Directory of Data Sets: This is a good starting point for more lists of data sets for machine learning.
Parts of DMOZ itself are available in RDF as a data set for researchers. There is also a processed version made available as part of the PASCAL Ontology Learning Challenge.
As well as the above institution or community organised lists, I also came across some maintained by individuals.
A few specific data sets caught my eye, some new, and some I just hadn’t seen before.
If all else fails and you still cannot find a suitable data set for your research, you can always invoke the social web and trawl through bookmarks on services like del.icio.us. The global data set tag can throw up some interesting hits occasionally but there might be a higher wheat to chaff ratio in particular user’s bookmarks, such as Peter Skomoroch. Mine is not nearly as comprehensive yet.
It would be interesting to do a meta-analysis of all these data sets to see how our ability as a discipline to deal with larger and more complex data sets has increased over time. As Daniel Lemire pointed out with some surprise recently, processing a terabyte of data isn’t that uncommon.
Mark Reid February 22, 2008 Canberra, Australia