Wired ran a controversial article last year on the “End of Theory” in which Chris Anderson argued that simply throwing statistical algorithms at the sheer quantity of data now available renders models obsolete. In doing so he misquotes Google’s research director Peter Norvig as saying “All models are wrong, and increasingly you can suceed without them.”

In his correction and lament, Norvig recounts an old AI koan that highlights the subtle but important difference between not knowing what the right model is and not assuming one at all:

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP–6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe,” Sussman replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.Minsky shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”At that moment, Sussman was enlightened.

The point being that *all* learning algorithms have a bias. Even if you don’t understand exactly what it is.^{1}

This is an important point. A student who first applies the latest Deep-Bayes-Vector-Network-Boost algorithm to a data set given in class can be seduced by its seemingly amazing performance and assume that (given enough data) it will always do wonderfully on any problem.

However, there is no free lunch. Each algorithm brings its own bias to bear on a problem. In some cases it will be appropriate for the task at hand while in others it can be misleading.

Moreover, there is no getting around this either. Tom Mitchell in his now classic textbook, Machine Learning, states that a bias is a necessary condition for learning:

A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances.

This is essentially an update of David Hume’s problem of induction and an algorithm’s bias could be seen as the *in silico* version of his “general habitual principle” of mind.

In a clever twist, Mitchell goes on to *define* bias to be any minimal set of assumptions that makes the behaviour of a learning algorithm on a set of training examples purely deductive. That is, whatever else is required to turn an ill-posed generalisation from examples into a turn-the-crank computation such as a search *is* the bias of the learning algorithm.^{2}

This view of biases for machine learning algorithms has had a significant impact on the discipline. These days, biases are often formalised as regularisation terms. These are penalties against overly complex models which prevent overfitting of examples that would otherwise occur if a loss was naïvely minimised. Results such as the representer theorem then guarantee that a solution to the learning problem under this bias can be found by solving a deterministic optimisation problem derived from the training examples.

Even though it is certain that we will have more and more data to throw at our problems and that our inference techniques are getting more and more sophisticated it is not the case that, as Chris Anderson put it, that

Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

Without a bias, the only inference you can make from a mountain of data is that you have a mountain of data.

It is always prudent to note that this type of inductive bias is different to a statistical bias.↩

Of course, there are subtleties surrounding non-deterministic algorithms. However, these usually require a seed for their random number generator which can be viewed as part of the algorithm’s bias.↩