# PAC-Bayes Mini-tutorial: A Continuous Union Bound

When I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.

NB On the website I can only provide a high-level sneak preview of the tutorial for brevity. For the full version, which is still quite short, but includes the proofs and fills in the missing details, see PAC-BayesMiniTutorial.pdf.

# The Cramér-Chernoff Method

I will start by outlining the Cramér-Chernoff method, from which Hoeffding’s and Bernstein’s inequalities and many others follow. This method is incredibly well explained in Appendix A of the textbook by Cesa-Bianchi and Lugosi , but I will have to change the presentation a little to easily connect with the PAC-Bayesian bounds later on.

Let $D =((X_1,Y_1),\ldots,(X_n,Y_n))$ be independent, identically distributed (i.i.d.) examples, and let $h$ be a hypothesis from a set of hypotheses $\mathcal{H}$, which gets loss $\ell(X_i,Y_i,h)$ on the $i$-th example. For example, we might think of the squared loss $\ell(X_i,Y_i,h) = (Y_i – h(X_i))^2$. We also define the empirical error ((Called the empirical risk in statistics; hence the notation with `R’.)) of $h$

\begin{equation*}
R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),
\end{equation*}

and our goal is to prove that the empirical error is close to the generalisation error

\begin{equation*}
R(h) = \E[\ell(X,Y,h)]
\end{equation*}

with high probability. To do this, we define the function

\begin{equation*}
M_\eta(h)
= -\tfrac{1}{\eta} \ln \E\Big[e^{-\eta \ell(X,Y,h)}\Big]
\qquad \text{for $\eta > 0$,}
\end{equation*}

which will act as a surrogate for $R(h)$. Now the Cramér-Chernoff method tells us that:

Lemma 1 For any $\eta > 0$, $\delta \in (0,1]$,
\begin{equation}\label{eqn:chernoff}
M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta}
\end{equation}with probability at least $1-\delta$.

Many standard concentration inequalities can then be derived by first relating $M_\eta(h)$ to $R(h)$, and then optimizing $\eta$. This includes, for example, Hoeffding’s inequality and inequalities involving the variance like Bernstein’s inequality. See the full version of this post for details.

# The Union Bound

Now suppose we use an estimator $\hat{h} \equiv \hat{h}(D) \in \mathcal{H}$ to pick a hypothesis based on the data, for example using empirical risk minimization: $\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)$. To get a bound for $\hat{h}$ instead of a fixed $h$, we want \eqref{eqn:chernoff} to hold for all $h \in \mathcal{H}$ simultaneously. If $\mathcal{H}$ is countable, this can be done using the union bound, which leads to:

Lemma 4 Suppose $\mathcal{H}$ is countable. For $h \in \mathcal{H}$, let $\pi(h)$ be any numbers such that $\pi(h) \geq 0$ and $\sum_h \pi(h)= 1$. Then, for any $\eta > 0$, $\delta \in (0,1]$, \begin{equation}
M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln
\frac{1}{\pi(\hat{h})\delta}
\end{equation} with probability at least $1-\delta$.

In this context, the function $\pi$ is often referred to as a prior distribution, even though it need not have anything to do with prior beliefs.

Just like for Lemma 1, we can then again relate $M_\eta(h)$ to $R(h)$ to obtain a bound on the generalisation error. This shows, in a nutshell, how one can combine the Cramér-Chernoff method with the union bound to obtain concentration inequalities for estimators $\hat{h}$. The use of the union bound, however, is quite crude when there are multiple hypotheses in $\mathcal{H}$ with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes $\mathcal{H}$. This is where PAC-Bayesian bounds come to the rescue: in the next section I will explain the PAC-Bayesian generalisation of Lemma 4 to continuous hypothesis classes $\mathcal{H}$, which will require replacing $\hat{h}$ by a randomized estimator.

# PAC-Bayesian Concentration

Let $\hat{\pi} \equiv \hat{\pi}(D)$ be a distribution on $\mathcal{H}$ that depends on the data $D$, which we will interpret as a randomized estimator: instead of choosing $\hat{h}$ deterministically, we will sample $h \sim \hat{\pi}$ randomly. The distribution $\hat{\pi}$ is often called the PAC-Bayesian posterior distribution. Now the result that the PAC-Bayesians have, may be expressed as follows:

Lemma 6 Let $\pi$ be a (prior) distribution on $\mathcal{H}$ that does not depend on $D$, and let $\hat{\pi}$ be a randomized estimator that is allowed to depend on $D$. Then, for any $\eta > 0$, $\delta \in (0,1]$, \begin{equation}\label{eqn:pacbayes}
\E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}\Big(D(\hat{\pi}\|\pi) + \ln \frac{1}{\delta}\Big)
\end{equation} with probability at least $1-\delta$. Moreover, \begin{equation}\label{eqn:pacbayesexp}
\E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}D(\hat{\pi}\|\pi)\Big].
\end{equation}

Here $D(\hat{\pi}\|\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h$ denotes the Kullback-Leibler divergence of $\hat{\pi}$ from $\pi$.

To see that Lemma 6 generalises Lemma 4, suppose that $\hat{\pi}$ is a point-mass on $\hat{h}$. Then $D(\hat{\pi}\|\pi) = \ln (1/\pi(\hat{h}))$, and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require $\mathcal{H}$ to be countable, and in fact in many PAC-Bayesian applications it is not.

• How to relate $M_\eta(h)$ to $R(h)$ to obtain, for example, Hoeffding’s inequality.
• How to optimize $\eta$, which comes “for free” in Lemma 4, but requires an additional union bound in the general PAC-Bayesian case of Lemma 6.
• How the PAC-Bayesians choose their prior $\pi$ and posterior $\hat{\pi}$. (There’s an unexpected trick called localisation.)