When I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.
I have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.
It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…Continue reading