PACBayes Minitutorial: A Continuous Union Bound
When I first encountered PACBayesian concentration inequalities they seemed to me to be rather disconnected from good oldfashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PACBayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.
NB On the website I can only provide a highlevel sneak preview of the tutorial for brevity. For the full version, which is still quite short, but includes the proofs and fills in the missing details, see PACBayesMiniTutorial.pdf. \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\)
The CramérChernoff Method
I will start by outlining the CramérChernoff method, from which Hoeffding’s and Bernstein’s inequalities and many others follow. This method is incredibly well explained in Appendix A of the textbook by CesaBianchi and Lugosi [3], but I will have to change the presentation a little to easily connect with the PACBayesian bounds later on.
Let \(D =((X_1,Y_1),\ldots,(X_n,Y_n))\) be independent, identically distributed (i.i.d.) examples, and let \(h\) be a hypothesis from a set of hypotheses \(\mathcal{H}\), which gets loss \(\ell(X_i,Y_i,h)\) on the \(i\)th example. For example, we might think of the squared loss \(\ell(X_i,Y_i,h) = (Y_i  h(X_i))^2\). We also define the empirical error ^{1} of \(h\)
\[R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),\]and our goal is to prove that the empirical error is close to the generalisation error
\[R(h) = \E[\ell(X,Y,h)]\]with high probability. To do this, we define the function
\[M_\eta(h) = \tfrac{1}{\eta} \ln \E\Big[e^{\eta \ell(X,Y,h)}\Big] \qquad \text{for $\eta > 0$,}\]which will act as a surrogate for \(R(h)\). Now the CramérChernoff method tells us that:
Lemma 1. For any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:chernoff} M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta} \end{equation}with probability at least \(1\delta\).
Many standard concentration inequalities can then be derived by first relating \(M_\eta(h)\) to \(R(h)\), and then optimizing \(\eta\). This includes, for example, Hoeffding’s inequality and inequalities involving the variance like Bernstein’s inequality. See the full version of this post for details.
The Union Bound
Now suppose we use an estimator \(\hat{h} \equiv \hat{h}(D) \in \mathcal{H}\) to pick a hypothesis based on the data, for example using empirical risk minimization: \(\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)\). To get a bound for \(\hat{h}\) instead of a fixed \(h\), we want \eqref{eqn:chernoff} to hold for all \(h \in \mathcal{H}\) simultaneously. If \(\mathcal{H}\) is countable, this can be done using the union bound, which leads to:
Lemma 4. Suppose \(\mathcal{H}\) is countable. For \(h \in \mathcal{H}\), let \(\pi(h)\) be any numbers such that \(\pi(h) \geq 0\) and \(\sum_h \pi(h)= 1\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation} M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln \frac{1}{\pi(\hat{h})\delta} \end{equation} with probability at least \(1\delta\).
In this context, the function \(\pi\) is often referred to as a prior distribution, even though it need not have anything to do with prior beliefs.
Just like for Lemma 1, we can then again relate \(M_\eta(h)\) to \(R(h)\) to obtain a bound on the generalisation error. This shows, in a nutshell, how one can combine the CramérChernoff method with the union bound to obtain concentration inequalities for estimators \(\hat{h}\). The use of the union bound, however, is quite crude when there are multiple hypotheses in \(\mathcal{H}\) with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes \(\mathcal{H}\). This is where PACBayesian bounds come to the rescue: in the next section I will explain the PACBayesian generalisation of Lemma 4 to continuous hypothesis classes \(\mathcal{H}\), which will require replacing \(\hat{h}\) by a randomized estimator.
PACBayesian Concentration
Let \(\hat{\pi} \equiv \hat{\pi}(D)\) be a distribution on \(\mathcal{H}\) that depends on the data \(D\), which we will interpret as a randomized estimator: instead of choosing \(\hat{h}\) deterministically, we will sample \(h \sim \hat{\pi}\) randomly. The distribution \(\hat{\pi}\) is often called the PACBayesian posterior distribution. Now the result that the PACBayesians have, may be expressed as follows:
Lemma 6. Let \(\pi\) be a (prior) distribution on \(\mathcal{H}\) that does not depend on \(D\), and let \(\hat{\pi}\) be a randomized estimator that is allowed to depend on \(D\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:pacbayes} \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}\Big(D(\hat{\pi}\pi) + \ln \frac{1}{\delta}\Big) \end{equation} with probability at least \(1\delta\). Moreover, \begin{equation}\label{eqn:pacbayesexp} \E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}D(\hat{\pi}\pi)\Big]. \end{equation}
Here \(D(\hat{\pi}\\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h\) denotes the KullbackLeibler divergence of \(\hat{\pi}\) from \(\pi\).
To see that Lemma 6 generalises Lemma 4, suppose that \(\hat{\pi}\) is a pointmass on \(\hat{h}\). Then \(D(\hat{\pi}\\pi) = \ln (1/\pi(\hat{h}))\), and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require \(\mathcal{H}\) to be countable, and in fact in many PACBayesian applications it is not.
Keep Reading…
We have seen how PACBayesian inequalities naturally extend standard concentration inequalities based on the CramérChernoff method by generalising the union bound to a continuous version. I’m afraid that’s all that will reasonably fit into a single blog post on my website. If you want more, simply continue with the full version, which covers several more issues, including:
 How to relate \(M_\eta(h)\) to \(R(h)\) to obtain, for example, Hoeffding’s inequality.
 How to optimize \(\eta\), which comes “for free” in Lemma 4, but requires an additional union bound in the general PACBayesian case of Lemma 6.
 How the PACBayesians choose their prior \(\pi\) and posterior \(\hat{\pi}\). (There’s an unexpected trick called localisation.)
 Proofs
 Literature references

Called the empirical risk in statistics; hence the notation with `R’. ↩