When I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.

# From Exp-concavity to Mixability

In sequential prediction (online learning) with expert advice the goal
is to predict a sequence of outcomes almost as well as the best
advisor from a pool of experts. The quality of predictions is measured
by a *loss function*, which is determined by the application one
has in mind. For most loss functions, the best performance that can be
guaranteed is to be within $O(\sqrt{T})$ of the best expert on $T$
outcomes, but for some special loss functions $O(1)$ overhead is
possible.

In the 1990’s, people familiar with the work of Vovk called these
special loss functions *mixable losses*, but nowadays the notion
of mixability appears to be mostly forgotten, and the geometric
concept of *exp-concavity* has taken its place. This raises the
question of how the two are related, which strangely does not appear
to be answered in very much detail in the literature. As I have been
studying mixability quite a bit in my recent work, I was wondering
about this, so here are some thoughts. **Update:** In particular, I will construct a parameterization of the squared loss in
which it is $1/2$-exp-concave instead of only $1/8$-exp-concave like
in its usual parameterization.

# Large deviations: Cramér vs Sanov

I have recently been reading up on two classical results from large deviation theory: the *Cramér-Chernoff theorem* and *Sanov’s theorem*. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.

It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…

Continue reading