NB On the website I can only provide a high-level sneak preview of the tutorial for brevity. For the full version, which is still quite short, but includes the proofs and fills in the missing details, see PAC-BayesMiniTutorial.pdf. \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\)
I will start by outlining the Cramér-Chernoff method, from which Hoeffding’s and Bernstein’s inequalities and many others follow. This method is incredibly well explained in Appendix A of the textbook by Cesa-Bianchi and Lugosi [1], but I will have to change the presentation a little to easily connect with the PAC-Bayesian bounds later on.
Let \(D =((X_1,Y_1),\ldots,(X_n,Y_n))\) be independent, identically distributed (i.i.d.) examples, and let \(h\) be a hypothesis from a set of hypotheses \(\mathcal{H}\), which gets loss \(\ell(X_i,Y_i,h)\) on the \(i\)-th example. For example, we might think of the squared loss \(\ell(X_i,Y_i,h) = (Y_i - h(X_i))^2\). We also define the empirical error ^{1} of \(h\)
\[R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),\]and our goal is to prove that the empirical error is close to the expected error
\[R(h) = \E[\ell(X,Y,h)]\]with high probability. To do this, we define the function
\[M_\eta(h) = -\tfrac{1}{\eta} \ln \E\Big[e^{-\eta \ell(X,Y,h)}\Big] \qquad \text{for $\eta > 0$,}\]which will act as a surrogate for \(R(h)\). Now the Cramér-Chernoff method tells us that:
Lemma 1. For any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:chernoff} M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta} \end{equation}with probability at least \(1-\delta\).
Many standard concentration inequalities can then be derived by first relating \(M_\eta(h)\) to \(R(h)\), and then optimizing \(\eta\). This includes, for example, Hoeffding’s inequality and inequalities involving the variance like Bernstein’s inequality. See the full version of this post for details.
Now suppose we use an estimator \(\hat{h} \equiv \hat{h}(D) \in \mathcal{H}\) to pick a hypothesis based on the data, for example using empirical risk minimization: \(\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)\). To get a bound for \(\hat{h}\) instead of a fixed \(h\), we want \eqref{eqn:chernoff} to hold for all \(h \in \mathcal{H}\) simultaneously. If \(\mathcal{H}\) is countable, this can be done using the union bound, which leads to:
Lemma 4. Suppose \(\mathcal{H}\) is countable. For \(h \in \mathcal{H}\), let \(\pi(h)\) be any numbers such that \(\pi(h) \geq 0\) and \(\sum_h \pi(h)= 1\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation} M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln \frac{1}{\pi(\hat{h})\delta} \end{equation} with probability at least \(1-\delta\).
In this context, the function \(\pi\) is often referred to as a prior distribution, even though it need not have anything to do with prior beliefs.
Just like for Lemma 1, we can then again relate \(M_\eta(h)\) to \(R(h)\) to obtain a bound on the expected error. This shows, in a nutshell, how one can combine the Cramér-Chernoff method with the union bound to obtain concentration inequalities for estimators \(\hat{h}\). The use of the union bound, however, is quite crude when there are multiple hypotheses in \(\mathcal{H}\) with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes \(\mathcal{H}\). This is where PAC-Bayesian bounds come to the rescue: in the next section I will explain the PAC-Bayesian generalisation of Lemma 4 to continuous hypothesis classes \(\mathcal{H}\), which will require replacing \(\hat{h}\) by a randomized estimator.
Let \(\hat{\pi} \equiv \hat{\pi}(D)\) be a distribution on \(\mathcal{H}\) that depends on the data \(D\), which we will interpret as a randomized estimator: instead of choosing \(\hat{h}\) deterministically, we will sample \(h \sim \hat{\pi}\) randomly. The distribution \(\hat{\pi}\) is often called the PAC-Bayesian posterior distribution. Now the result that the PAC-Bayesians have, may be expressed as follows:
Lemma 6. Let \(\pi\) be a (prior) distribution on \(\mathcal{H}\) that does not depend on \(D\), and let \(\hat{\pi}\) be a randomized estimator that is allowed to depend on \(D\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:pacbayes} \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}\Big(D(\hat{\pi}|\pi) + \ln \frac{1}{\delta}\Big) \end{equation} with probability at least \(1-\delta\). Moreover, \begin{equation}\label{eqn:pacbayesexp} \E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}D(\hat{\pi}|\pi)\Big]. \end{equation}
Here \(D(\hat{\pi}\|\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h\) denotes the Kullback-Leibler divergence of \(\hat{\pi}\) from \(\pi\).
To see that Lemma 6 generalises Lemma 4, suppose that \(\hat{\pi}\) is a point-mass on \(\hat{h}\). Then \(D(\hat{\pi}\|\pi) = \ln (1/\pi(\hat{h}))\), and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require \(\mathcal{H}\) to be countable, and in fact in many PAC-Bayesian applications it is not.
We have seen how PAC-Bayesian inequalities naturally extend standard concentration inequalities based on the Cramér-Chernoff method by generalising the union bound to a continuous version. I’m afraid that’s all that will reasonably fit into a single blog post on my website. If you want more, simply continue with the full version, which covers several more issues, including:
Called the empirical risk in statistics; hence the notation with `R’. ↩