Large deviations: Cramér vs Sanov

Posted on August 03, 2012

5 minute read

I have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.

It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…

The Cramér-Chernoff Theorem

Let \(X, X_1, ..., X_n\) be independent, identically distributed random variables, and \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\) let \(\mu(\lambda) = \log \E[e^{\lambda X}]\) be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett) can be derived [1, Appendix A.1] from a single basic result:

\[\Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) \leq e^{-n\{\lambda a - \mu(\lambda)\}}\qquad(\lambda > 0).\]

This result can easily be proved using Markov’s inequality and is non-trivial as long as \(a > \E[X]\). Its ubiquity is explained by the Cramér-Chernoff theorem [2], which states that it asymptotically achieves the best possible exponent if we optimize our choice of \(\lambda\):

\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) = \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\]

Sanov’s Theorem

The empirical distribution \(P_n\) of \(X_1, \ldots, X_n\) gives probability \(P_n(A) = \frac{1}{n}\sum_{i=1}^n {\bf 1}_{\{X_i \in A\}}\) to any event \(A\), which is the fraction of variables taking their value in \(A\). If the distribution of the variables is \(P^*\), then asymptotically we would expect \(P_n\) to be close to \(P^*\). So then if \(\mathcal{P}\) is a set of distributions that is far away from \(P^*\) in some sense, the probability that \(P_n \in \mathcal{P}\) should be small. This intuition is quantified by Sanov’s theorem [3], [4].

To avoid measure-theoretic complications, let us assume that the variables \(X_1, \ldots, X_n\) are discrete, with a finite number of possible values. Suppose \(\mathcal{P}\) is a convex set of distributions and assume that the information projection

\[Q^* = \argmin_{P \in \mathcal{P}}\ D(P\|P^*)\]

of \(P^*\) on \(\mathcal{P}\) exists, where \(D(P\|P^*) = \sum_x P(x) \log \frac{P(x)}{P^*(x)}\) is the Kullback-Leibler divergence. Then \(D(Q^*\|P^*)\) provides an information-theoretic measure of the “distance” of \(\mathcal{P}\) from \(P^*\), and indeed [3]

\[\Pr\Big(P_n \in \mathcal{P}\Big) \leq e^{-nD(Q^*\|P^*)}.\]

Moreover, Sanov’s theorem tells us that this bound is asymptotically tight in the exponent:

\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(P_n \in \mathcal{P}\Big) = D(Q^*\|P^*).\]

Csiszár [3, p. 790] has an elegant information-theoretic proof of the upper bound. He also works out sufficient measure-theoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.

One may extend Sanov’s theorem to non-convex sets \(\mathcal{P}\) by a union bound argument. For example, Cover and Thomas [4] take a union bound over all possible values for \(P_n\), which they call types. By discretization arguments one may further extend the theorem to infinite spaces [5], but then things get a bit too asymptotic for my taste.

From Sanov to Cramér-Chernoff

It turns out that the Cramér-Chernoff theorem can be obtained from Sanov’s theorem. (This is called contraction.) The trick is to observe that

\[\E_{X \sim P_n}[X] = \frac{1}{n} \sum_{i=1}^n X_i,\]

so if we define the convex set \(\mathcal{P} = \{P \mid \E_{X \sim P}[X] \geq a\}\), then

\[\Pr\Big(P_n \in \mathcal{P}\Big) = \Pr\Big(\frac{1}{n} \sum_{i=1}^n X_i \geq a\Big).\]

It remains to evaluate \(D(Q^* \| P^*)\) in this case, which can be done as follows:

Introduce a Lagrange multiplier to obtain:
\[\min_{P \in \mathcal{P}} D(P\|P^*) = \min_{\{P \mid \E_{X \sim P}[X] \geq a\}} D(P\|P^*) = \min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]
Use Sion’s minimax theorem to swap the order of the min and the sup:
\[\min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} = \sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]
Recognize the following convex duality for Kullback-Leibler divergence:
\[\sup_P\ \big(\E_{X \sim P}[\lambda X] - D(P||P^*\big) = \mu(\lambda),\]

where \(\mu(\lambda) = \log \E_{P^*}[e^{\lambda X}]\) is the cumulant generating function defined above. We get:

\[\begin{split}\sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} &= \sup_{\lambda > 0}\Big\{\lambda a -\sup_P\Big(\E_P[\lambda X]- D(P\|P^*)\Big)\Big\}\\&= \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\end{split}\]

Chaining everything together we exactly recover the Cramér-Chernoff theorem, and we see that the upper bounds have exactly the same constants.

From Cramér-Chernoff to Sanov

One may also view things the other way around. The Cramér-Chernoff theorem bounds the probability that the value of the empirical mean \(\frac{1}{n} \sum_{i=1}^n X_i\) lies in the set \(A = [a,\infty)\). As discussed by [2, pp. 208-210], both the notion of empirical mean and the set \(A\) can be generalized. In particular, one may regard the empirical distribution \(P_n\) as the mean of \(n\) point-masses (i.e. Dirac measures) on the points \(X_1, \ldots, X_n\). Van der Vaart then presents Sanov’s theorem as just one instance of such generalized Cramér-Chernoff theorems.

Conclusion

We have seen the close similarities between the Cramér-Chernoff theorem and Sanov’s theorem. For me Sanov’s theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measure-theoretic conditions of Csiszár [3]. For technical reasons it may therefore be preferable to use the Cramér-Chernoff result.

Last Words

It turns out that the upper bound in the Cramér-Chernoff theorem does leave some slack in the order of \(1/\sqrt{n}\), which is negligible compared to the term in the exponent.

References

N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. Cambridge University Press, 2006.
A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998.
I. Csiszár, “Sanov Property, Generalized I-Projection and a Conditional Limit Theorem,” The Annals of Probability, vol. 12, no. 3, pp. 768–793, 1984.
T. M. Cover and J. A. Thomas, Elements of Information Theory, Second Edition. John Wiley & Sons, 2006.
F. den Hollander, “Large deviations,” in Large Deviations, vol. 14, American Mathematical Society, 2000.

Large deviations: Cramér vs Sanov

The Cramér-Chernoff Theorem

Sanov’s Theorem

From Sanov to Cramér-Chernoff

From Cramér-Chernoff to Sanov

Conclusion

Last Words

References

Other posts you might like:

Why FTRL Is Better than Online Mirror Descent

Online Learning: How to Filter Out Outliers

PAC-Bayes Mini-tutorial: A Continuous Union Bound