I have been thinking about learning and prediction as services for some time now. Like all good ideas, they tend to be thought of independently by several people when their time is ripe. Therefore I was not completely surprised when I heard the news yesterday that Google has released a new RESTful prediction API.
As a couple of other bloggers (John, Panos) have already noted, this is very exciting as it has the potential of making statistical inference a commodity and putting machine learning tools in the hands of everyday developers.
The details are a little scant as the API is not yet open to the public at the moment but, as the FAQ and sample code explain, it appears to work as follows:
A data set in CSV format is uploaded to Google storage. This can contain up to 100 million rows of text or numeric features. Each row can be associated with one of up to several hundred classes.
The URL obtained after uploading the data set is POSTed to a second URL, /prediction/v1/train/DATA_ID
, for Google’s learning algorithm (all URLs are relative to https://googleapis.com
). It is not clear what algorithms are being used behind the scenes for this step but the home page says the API will automatically choose from a variety of techniques.
The training occurs asynchronously and its progress can be queried by issuing a GET to /prediction/v1/query/DATA_ID
. Once training is completed, this query will return a cross-validated estimate of the learned model’s accuracy.
To make a new prediction with the trained model, a POST request containing the data to classify is sent to the /prediction/v1/query/DATA_ID
URL and a label prediction is returned.
Although this is a relatively simple API and, at present, only deals with classification, I believe it has the potential to cover a large proportion of most web developers’ prediction needs (e.g., text classification, sentiment analysis, click-through analysis) as well as several scientific applications.
The Google Prediction API is not the first to offers inferential services over the web but I do think they are the first to focus on building reusable predictors and to do it with a clean API design.
Some other projects offering prediction services include:
uClassify — This is probably the closest existing service to Google’s. It also provides an API for training and predicting but, upon a cursory examination, appears a bit more complicated than the Google prediction API. I believe the main algorithm used by uClassify is a variant of naïve Bayes.
MLcomp — This recently announced service is more aimed at machine learning researchers and provides a convenient way to compare several algorithms on a selection of data sets using a variety of metrics. Unlike Google’s offering, MLcomp does not make the trained predictors available via an API and focuses more on providing easily repeatable experiments. One nice thing about the MLcomp service is that anyone is free to upload learning algorithms provided they implement a simple calling pattern.
predict — A simpler MLcomp-like service built by Joshua Reich that lets users upload CSV files to learn from and/or snippets of R code to run. Once again, the aim is to evaluate algorithms rather than train predictors for subsequent use.
TunedIT — Similar to MLcomp and i2pi’s predict, this service aims to make comparing learning algorithms across data sets easier. As far as I can tell, it does not offer an API for running learners and predictors over the web but rather offers users the ability to create data-mining challenges that other users can compete within.
ExpDB — This is not so much a service as a growing database of experimental results but I thought I’d include it here as it has a similar focus to the last three projects. The main innovation here is the creation of a language — ExpML — for describing and querying the parameters, algorithms, data sets and results of machine learning experiments.
RL-Glue — While not about prediction per se, this somewhat older project is related as it offers an API for defining reinforcement learning problems that can be solved in a programming language-independent way.
Of course, there are also many machine learning toolkits such as Weka, Orange, Elefant, Rattle and more that provide implementations of algorithms, but these do not offer them as services.
Over the last few years we’ve seen a dramatic increase in the amount of data being generated and made available over the web (e.g., Freebase, DBpedia, Data.gov, Netflix, and protein databases). Also, thanks to services by Google, Amazon and others, there has also been a large-scale commodification of computational power and storage.
There are a handful of companies at present — Flightcaster, for example — who have realised that there is immense opportunity at the intersection of these developments to start applying large-scale machine learning. Hopefully, what the Google Prediction API and other services will provide is the spark for an explosion of new and creative approaches to distilling knowledge from raw information.
I will be watching how this all unfolds with great interest.
Mark Reid May 21, 2010 Canberra, Australia