The Compression Lemma

I recently re-read an ICML 2006 paper called On Bayesian Bounds by Arindam Banerjee that has at its core a very elegant and general result. I spent a bit of time trying to carefully understand it and thought I would write up my take on it. Here’s a slight restatement of the key result.

Compression Lemma (Banerjee, 2006)
For any function equation and any pair of distributions equation and equation over equation we have

equation
\displaystyle\mathbb{E}_Q\left[\phi(x)\right] - \log \mathbb{E}_P\left[e^{\phi(x)}\right] \le KL(Q\|P)

where equation is the Kulback-Leibler divergence from equation to equation. Furthermore, the bound is achieved for equation.

Notice that there are no restrictions placed on the choice of the function (aside from implicit measurability) or the distributions. The fact that this inequality holds so broadly is the reason Banerjee is able to use it to derive a PAC-Bayesian bound and other related bounds in online learning. I won’t go into these here but it is worth looking at the paper to see how easily these are derived through a judicious choice of equation.

The reason for the name compression lemma is not apparent at first. Indeed, apart from the presence of the KL divergence, there is little to immediately connect it with information theory. Fortunately, the paper does give a way of looking at it that makes this connection clear.

What follows is my own simplification of the proof given by Banerjee. It uses all the same ideas but simplifies his argument a little.

A Simple Proof

We need to introduce a slight variant of the KL divergence which, for wont of a better name, I’ll call the relative cross entropy of distributions equation and equation relative to equation:

equation
\displaystyle C(Q,R\|P) = \mathbb{E}_Q\left[\log \frac{dR}{dP} \right].

The negative relative cross entropy can be interpreted as the expected code length required to encode symbols drawn from equation when using a code based on equation.

As you can see, the relative cross entropy is equal to the KL divergence when equation. In fact, we can easily see that

equation
\displaystyle C(Q,R\|P) = KL(Q\|P) - KL(Q\|R)

by noting that equation and splitting and matching the resulting expectation with the definition of KL divergence.

Since the KL divergence is guaranteed to be non-negative, we see immediately that equation with equality holding when equation, that is, when equation.

Now we consider a specific choice of equation, namely a Gibbs distribution based on the function equation. Its density relative to equation is given by

equation
\displaystyle \frac{dR}{dP}(x) = \frac{1}{Z_{\phi}} e^{\phi(x)}

where equation is just a normalising term.

Plugging this definition of equation into equation gives

equation
\displaystyle \mathbb{E}_Q\left[\log\left(Z_{\phi}^{-1}e^{\phi(x)}\right)\right] = \mathbb{E}_Q\left[\phi(x)\right] - \log \mathbb{E}_P\left[ e^{\phi(x)} \right]

and since equation we have proved the first part of the compression lemma. The second part follows by substituting equation into the definition of the Gibbs distribution equation and noting that, in this case, equation.

Some Observations

Banerjee makes a number of insightful observations about the compression lemma, including its connection with the class of tight bounds achievable via Fenchel duality.

One small thing I think my proof adds is a simple quantification of the gap in the compression lemma inequality. Specifically, it is equation—the divergence of equation, the distribution generating the data, from equation, the distribution modelling equation. While this doesn’t have a simple form when equation is a Gibbs distribution for equation (at least I can’t see it) it has straight-forward interpretation.

Mark Reid 16 August 2010 Canberra, Australia