PAC-Bayes Mini-tutorial: A Continuous Union Bound

When I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.

NB On the website I can only provide a high-level sneak preview of the tutorial for brevity. For the full version, which is still quite short, but includes the proofs and fills in the missing details, see PAC-BayesMiniTutorial.pdf.

The Cramér-Chernoff Method

I will start by outlining the Cramér-Chernoff method, from which Hoeffding’s and Bernstein’s inequalities and many others follow. This method is incredibly well explained in Appendix A of the textbook by Cesa-Bianchi and Lugosi [3], but I will have to change the presentation a little to easily connect with the PAC-Bayesian bounds later on.

Let $D =((X_1,Y_1),\ldots,(X_n,Y_n))$ be independent, identically distributed (i.i.d.) examples, and let $h$ be a hypothesis from a set of hypotheses $\mathcal{H}$, which gets loss $\ell(X_i,Y_i,h)$ on the $i$-th example. For example, we might think of the squared loss $\ell(X_i,Y_i,h) = (Y_i – h(X_i))^2$. We also define the empirical error1 of $h$

R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),

and our goal is to prove that the empirical error is close to the generalisation error

R(h) = \E[\ell(X,Y,h)]

with high probability. To do this, we define the function

= -\tfrac{1}{\eta} \ln \E\Big[e^{-\eta \ell(X,Y,h)}\Big]
\qquad \text{for $\eta > 0$,}

which will act as a surrogate for $R(h)$. Now the Cramér-Chernoff method tells us that:

Lemma 1 For any $\eta > 0$, $\delta \in (0,1]$,
M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta}
\end{equation}with probability at least $1-\delta$.

Many standard concentration inequalities can then be derived by first relating $M_\eta(h)$ to $R(h)$, and then optimizing $\eta$. This includes, for example, Hoeffding’s inequality and inequalities involving the variance like Bernstein’s inequality. See the full version of this post for details.

The Union Bound

Now suppose we use an estimator $\hat{h} \equiv \hat{h}(D) \in \mathcal{H}$ to pick a hypothesis based on the data, for example using empirical risk minimization: $\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)$. To get a bound for $\hat{h}$ instead of a fixed $h$, we want \eqref{eqn:chernoff} to hold for all $h \in \mathcal{H}$ simultaneously. If $\mathcal{H}$ is countable, this can be done using the union bound, which leads to:

Lemma 4 Suppose $\mathcal{H}$ is countable. For $h \in \mathcal{H}$, let $\pi(h)$ be any numbers such that $\pi(h) \geq 0$ and $\sum_h \pi(h)= 1$. Then, for any $\eta > 0$, $\delta \in (0,1]$, \begin{equation}
M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln
\end{equation} with probability at least $1-\delta$.

In this context, the function $\pi$ is often referred to as a prior distribution, even though it need not have anything to do with prior beliefs.

Just like for Lemma 1, we can then again relate $M_\eta(h)$ to $R(h)$ to obtain a bound on the generalisation error. This shows, in a nutshell, how one can combine the Cramér-Chernoff method with the union bound to obtain concentration inequalities for estimators $\hat{h}$. The use of the union bound, however, is quite crude when there are multiple hypotheses in $\mathcal{H}$ with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes $\mathcal{H}$. This is where PAC-Bayesian bounds come to the rescue: in the next section I will explain the PAC-Bayesian generalisation of Lemma 4 to continuous hypothesis classes $\mathcal{H}$, which will require replacing $\hat{h}$ by a randomized estimator.

PAC-Bayesian Concentration

Let $\hat{\pi} \equiv \hat{\pi}(D)$ be a distribution on $\mathcal{H}$ that depends on the data $D$, which we will interpret as a randomized estimator: instead of choosing $\hat{h}$ deterministically, we will sample $h \sim \hat{\pi}$ randomly. The distribution $\hat{\pi}$ is often called the PAC-Bayesian posterior distribution. Now the result that the PAC-Bayesians have, may be expressed as follows:

Lemma 6 Let $\pi$ be a (prior) distribution on $\mathcal{H}$ that does not depend on $D$, and let $\hat{\pi}$ be a randomized estimator that is allowed to depend on $D$. Then, for any $\eta > 0$, $\delta \in (0,1]$, \begin{equation}\label{eqn:pacbayes}
\E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}\Big(D(\hat{\pi}\|\pi) + \ln \frac{1}{\delta}\Big)
\end{equation} with probability at least $1-\delta$. Moreover, \begin{equation}\label{eqn:pacbayesexp}
\E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}D(\hat{\pi}\|\pi)\Big].

Here $D(\hat{\pi}\|\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h$ denotes the Kullback-Leibler divergence of $\hat{\pi}$ from $\pi$.

To see that Lemma 6 generalises Lemma 4, suppose that $\hat{\pi}$ is a point-mass on $\hat{h}$. Then $D(\hat{\pi}\|\pi) = \ln (1/\pi(\hat{h}))$, and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require $\mathcal{H}$ to be countable, and in fact in many PAC-Bayesian applications it is not.

Keep Reading…

We have seen how PAC-Bayesian inequalities naturally extend standard concentration inequalities based on the Cramér-Chernoff method by generalising the union bound to a continuous version. I’m afraid that’s all that will reasonably fit into a single blog post on my website. If you want more, simply continue with the full version, which covers several more issues, including:

  • How to relate $M_\eta(h)$ to $R(h)$ to obtain, for example, Hoeffding’s inequality.
  • How to optimize $\eta$, which comes “for free” in Lemma 4, but requires an additional union bound in the general PAC-Bayesian case of Lemma 6.
  • How the PAC-Bayesians choose their prior $\pi$ and posterior $\hat{\pi}$. (There’s an unexpected trick called localisation.)
  • Proofs
  • Literature references

  1. Called the empirical risk in statistics; hence the notation with `R’.