But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper [4] with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.
Before getting into the technical side of things, it is helpful to consider several reasons why outliers might occur:
To deal with all such cases simultaneously, we want to avoid making any assumptions about the nature of the outliers. This motivates the game-theoretic approach described below. \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\argmin}{arg\,min} \DeclareMathOperator{\diameter}{diameter} \DeclareMathOperator{\risk}{Risk} \newcommand{\u}{\boldsymbol u} \newcommand{\w}{\boldsymbol w} \newcommand{\grad}{\boldsymbol g} \newcommand{\domain}{\mathcal{W}} \newcommand{\reals}{\mathbb{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\gradlist}{\mathcal{L}}\)
Let’s start by fixing the setting, which will be a variant of the standard online convex optimization framework: Each round \(t=1,\ldots,T\), the learner predicts a point \(\w_t\) from a convex domain \(\domain \subset \reals^d\) of diameter at most \(\diameter(\domain) \leq D\). The environment then selects a convex loss function \(f_t : \domain \to \reals\) and the learner incurs loss \(f_t(\w_t)\). Most online learning algorithms update their predictions efficiently, using only the gradient \(\grad_t = \nabla f_t(\w_t)\), and the standard objective is for the learner to control the regret
\[R_T(\u) = \sum_{t=1}^T \big(f_t(\w_t) - f_t(\u)\big) \qquad \textrm{(standard regret)}\]for all \(\u \in \domain\) simultaneously.
Suppose now that only a subset of rounds \(\cS \subseteq [T] := \{1,\ldots,T\}\) consist of reliable inliers and the other rounds \([T] \setminus \cS\) are all outliers that our learning algorithm should ignore. Then the objective will be to control the robust regret:
\[R_T(\u,\cS) = \sum_{t \in \cS} \big(f_t(\w_t) - f_t(\u)\big) \qquad \textrm{(robust regret)}.\]Note the difference, which is that we are measuring performance only on inlier rounds in \(\cS\).
The robust regret setup creates the following challenges:
Since we do not assume that outliers behave reasonably in any way, it follows that bounds on the robust regret can only depend on the scale \(G(\cS)\) of the inliers, but not on the lengths of the outliers!
So what can we do? We will take a very general approach, in which we modify any existing online learner ALG by filtering out some rounds with extreme gradients. By filtering a round, I mean that we pretend to ALG that the round did not happen, so we do not feed it any gradient and ALG’s prediction remains unchanged.
We will assume that there are at most \(k\) outliers, so
\[T - |\cS| \leq k.\]Then the filtering strategy, which we call top-k filter, works as follows:
Note the factor 2 in front of \(\min \gradlist_t\) in the filtering threshold!
One thing to like about top-k filter is that it is a nice and simple approach, but of course true happiness can only be found if its performance is any good. In the paper, we first show the following general guarantee for top-k filtering when the losses are linear, i.e. \(f_t(\w) = \w^\top \grad_t\).
Theorem 1 (Linear Losses). Suppose losses are linear, and ALG achieves regret bound \(B_T(G)\) when running on at most \(T\) rounds with gradients of length at most \(G\). Then ALG + top-k filter guarantees that the robust regret does not exceed
\[R_T(\u,\cS) \leq B_T(2 G(\cS)) + 4 D G(\cS)(k+1) \qquad \textrm{for any $\cS: T - |\cS| \leq k$.}\]See below for the main ideas behind the proof.
The first term, \(B_T(2 G(\cS))\), comes from feeding ALG gradients of length at most \(2 G(\cS)\), which it turns out is guaranteed by top-k filter. The second term of order \(O(G(\cS) k)\) is the price we pay for achieving robustness. Note that it scales with the size of the inlier gradients \(G(\cS)\) only, and does not depend on the outliers at all!
At first sight, it might appear restrictive that the bound only applies to linear losses, but we can easily get around this using a standard reduction from general convex losses to the linear case via the inequality \(f_t(\w_t) - f_t(\u) \leq (\w_t - \u)^\top \grad_t\). There exists a similar reduction for so-called strongly convex losses. By instantiating ALG to be the standard online gradient descent algorithm, this leads to the following guarantees on the robust regret:
Losses | Minimax Standard Regret | Minimax Robust Regret |
---|---|---|
General Convex Losses | \(O(\sqrt{T})\) | \(O(\sqrt{T} + k)\) |
Strongly Convex Losses | \(O(\ln(T))\) | \(O(\ln(T) + k)\) |
These rates turn out to be optimal, because we prove matching lower bounds both for general convex losses and for strongly convex losses. In fact, for general convex losses the lower bound construction uses independent, identically distributed (i.i.d.) losses, so even if we assume that the losses are coming from a nice fixed distribution, then our upper bound is still optimal. We see that the optimal price of robustness is an additive \(O(k)\) in quite some generality, and top-k filter achieves it.
So are we happy? Well, the pursuit of happiness is a fool’s errant. We should rather aim to live a fulfilling life, but certainly there is nothing more fulfilling than optimality!
The main ideas behind top-k filtering and Theorem 1 are as follows:
1. Top-k filter ensures that we never pass ALG any gradients whose length exceeds \(2 G(\cS)\).
To see this, note that:
2. The overhead for filtering is \(O(k)\).
We can account for incorrectly filtered rounds \(t \in \cS\) as follows:
There is one more result from the paper that I would like to highlight, which is that the robust regret can be used for robust learning in the Huber ε-contamination setting via a robustified variant of the usual online-to-batch conversion trick.
In the Huber ε-contamination model, our losses are of the form \(f_t(\w) = f(\w,\xi_t)\), where the underlying data \(\xi_1,\ldots,\xi_T\) are sampled i.i.d. from a mixture distribution \(P_\epsilon\):
\[P_\epsilon = (1-\epsilon) P + \epsilon Q.\]The interpretation is that observations from the distribution of interest \(P\) are corrupted by observations from some unknown outlier distribution \(Q\) with probability \(\epsilon\). This is a batch-learning setting, in which the goal is to output a single set of parameters \(\bar \w_T\) that achieve small risk under \(P\):
\[\risk_P(\w) = \E_{\xi \sim P}[f(\w,\xi)].\]When \(\epsilon = 0\) (i.e., there are no outliers), the standard online-to-batch conversion approach [5] allows us to average the predictions \(\w_1,\ldots,\w_T\) of any online learning algorithm according to
\[\bar \w_T = \frac{1}{T} \sum_{t=1}^T \w_t\]with the guarantee that
\[\E_P[\risk_P(\bar \w_T) - \risk_P(\u_P)] \leq \frac{\E_P[R_T(\u_P)]}{T},\]where \(\u_P = \argmin_{\u \in \domain} \risk_P(\u)\).
It turns out that this can be generalized to the case \(\epsilon > 0\), in which there are outliers, if we replace the standard regret by the robust regret:
\[\E_{P_\epsilon}[\risk_P(\bar \w_T) - \risk_P(\u_P)] \leq \frac{\E_{P_\epsilon}[R_T(\u_P,\cS^*)]}{(1-\epsilon)T},\]where the inliers \(\cS^*\) are those rounds in which we receive a sample from the distribution of interest \(P\). Notice that now the expectations are under the mixture distribution \(P_\epsilon\), but the risk is measured with respect to \(P\)! If the losses or gradients of the inliers are bounded \(P\)-almost surely, then an analogous result also holds with high probability.
For instance, we show the following corollary, which achieves rates for Huber ε-contamination that are optimal under its assumptions:
Corollary. Suppose \(\|\nabla f(\w,\xi)\| \leq G\) when \(\xi \sim P\) is an inlier. Then online gradient descent with top-k filtering achieves
\[\risk_P(\bar \w_T) - \risk_P(\u_P) = O\Big(DG \epsilon + DG \sqrt{\frac{\ln(1/\delta)}{T}}\Big)\]with \(P_\epsilon\)-probability at least \(1-\delta\), for \(k\) tuned suitably depending on \(\epsilon,T\) and \(\delta \in (0,1]\).
In this post, I have described a filtering approach that provides a way to robustify any online learning algorithm ALG by filtering out large gradients using top-k filter. This approach turns out to be optimal in the new robust regret setting, which further allows a type of robustified online-to-batch conversion for the Huber ε-contamination model.
NB On the website I can only provide a high-level sneak preview of the tutorial for brevity. For the full version, which is still quite short, but includes the proofs and fills in the missing details, see PAC-BayesMiniTutorial.pdf. \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\)
I will start by outlining the Cramér-Chernoff method, from which Hoeffding’s and Bernstein’s inequalities and many others follow. This method is incredibly well explained in Appendix A of the textbook by Cesa-Bianchi and Lugosi [1], but I will have to change the presentation a little to easily connect with the PAC-Bayesian bounds later on.
Let \(D =((X_1,Y_1),\ldots,(X_n,Y_n))\) be independent, identically distributed (i.i.d.) examples, and let \(h\) be a hypothesis from a set of hypotheses \(\mathcal{H}\), which gets loss \(\ell(X_i,Y_i,h)\) on the \(i\)-th example. For example, we might think of the squared loss \(\ell(X_i,Y_i,h) = (Y_i - h(X_i))^2\). We also define the empirical error ^{1} of \(h\)
\[R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),\]and our goal is to prove that the empirical error is close to the expected error
\[R(h) = \E[\ell(X,Y,h)]\]with high probability. To do this, we define the function
\[M_\eta(h) = -\tfrac{1}{\eta} \ln \E\Big[e^{-\eta \ell(X,Y,h)}\Big] \qquad \text{for $\eta > 0$,}\]which will act as a surrogate for \(R(h)\). Now the Cramér-Chernoff method tells us that:
Lemma 1. For any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:chernoff} M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta} \end{equation}with probability at least \(1-\delta\).
Many standard concentration inequalities can then be derived by first relating \(M_\eta(h)\) to \(R(h)\), and then optimizing \(\eta\). This includes, for example, Hoeffding’s inequality and inequalities involving the variance like Bernstein’s inequality. See the full version of this post for details.
Now suppose we use an estimator \(\hat{h} \equiv \hat{h}(D) \in \mathcal{H}\) to pick a hypothesis based on the data, for example using empirical risk minimization: \(\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)\). To get a bound for \(\hat{h}\) instead of a fixed \(h\), we want \eqref{eqn:chernoff} to hold for all \(h \in \mathcal{H}\) simultaneously. If \(\mathcal{H}\) is countable, this can be done using the union bound, which leads to:
Lemma 4. Suppose \(\mathcal{H}\) is countable. For \(h \in \mathcal{H}\), let \(\pi(h)\) be any numbers such that \(\pi(h) \geq 0\) and \(\sum_h \pi(h)= 1\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation} M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln \frac{1}{\pi(\hat{h})\delta} \end{equation} with probability at least \(1-\delta\).
In this context, the function \(\pi\) is often referred to as a prior distribution, even though it need not have anything to do with prior beliefs.
Just like for Lemma 1, we can then again relate \(M_\eta(h)\) to \(R(h)\) to obtain a bound on the expected error. This shows, in a nutshell, how one can combine the Cramér-Chernoff method with the union bound to obtain concentration inequalities for estimators \(\hat{h}\). The use of the union bound, however, is quite crude when there are multiple hypotheses in \(\mathcal{H}\) with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes \(\mathcal{H}\). This is where PAC-Bayesian bounds come to the rescue: in the next section I will explain the PAC-Bayesian generalisation of Lemma 4 to continuous hypothesis classes \(\mathcal{H}\), which will require replacing \(\hat{h}\) by a randomized estimator.
Let \(\hat{\pi} \equiv \hat{\pi}(D)\) be a distribution on \(\mathcal{H}\) that depends on the data \(D\), which we will interpret as a randomized estimator: instead of choosing \(\hat{h}\) deterministically, we will sample \(h \sim \hat{\pi}\) randomly. The distribution \(\hat{\pi}\) is often called the PAC-Bayesian posterior distribution. Now the result that the PAC-Bayesians have, may be expressed as follows:
Lemma 6. Let \(\pi\) be a (prior) distribution on \(\mathcal{H}\) that does not depend on \(D\), and let \(\hat{\pi}\) be a randomized estimator that is allowed to depend on \(D\). Then, for any \(\eta > 0\), \(\delta \in (0,1]\), \begin{equation}\label{eqn:pacbayes} \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}\Big(D(\hat{\pi}|\pi) + \ln \frac{1}{\delta}\Big) \end{equation} with probability at least \(1-\delta\). Moreover, \begin{equation}\label{eqn:pacbayesexp} \E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] + \frac{1}{\eta n}D(\hat{\pi}|\pi)\Big]. \end{equation}
Here \(D(\hat{\pi}\|\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h\) denotes the Kullback-Leibler divergence of \(\hat{\pi}\) from \(\pi\).
To see that Lemma 6 generalises Lemma 4, suppose that \(\hat{\pi}\) is a point-mass on \(\hat{h}\). Then \(D(\hat{\pi}\|\pi) = \ln (1/\pi(\hat{h}))\), and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require \(\mathcal{H}\) to be countable, and in fact in many PAC-Bayesian applications it is not.
We have seen how PAC-Bayesian inequalities naturally extend standard concentration inequalities based on the Cramér-Chernoff method by generalising the union bound to a continuous version. I’m afraid that’s all that will reasonably fit into a single blog post on my website. If you want more, simply continue with the full version, which covers several more issues, including:
Called the empirical risk in statistics; hence the notation with `R’. ↩
In the 1990’s, people familiar with the work of Vovk called these special loss functions mixable losses, but nowadays the notion of mixability appears to be mostly forgotten, and the geometric concept of exp-concavity has taken its place. This raises the question of how the two are related, which strangely does not appear to be answered in very much detail in the literature. As I have been studying mixability quite a bit in my recent work, I was wondering about this, so here are some thoughts. Update: In particular, I will construct a parameterization of the squared loss in which it is \(1/2\)-exp-concave instead of only \(1/8\)-exp-concave like in its usual parameterization.
This post is also available as FromExpConcavityToMixability.pdf.
Suppose we predict an outcome \(y \in \mathcal{Y}\) by specifying a prediction \(a \in \mathcal{A}\). The better our prediction, the smaller our loss \(\ell(y,a)\). (I will assume \(\ell(y,a)\) is nonnegative, but that does not really matter.) For example, if \(y\) and \(a\) both take values in \(\{0,1\}\), then the \(0/1\)-loss \(\ell(y,a) = |y-a|\) is \(0\) if we predict correctly and \(1\) otherwise. Alternatively, if \(y\) and \(a\) are both real-valued, then the squared loss is \(\ell(y,a) = (y-a)^2\). And finally, if \(a\) specifies a probability density \(f_a\) on \(\mathcal{Y}\), then our loss may be the log loss \(\ell(y,a) = -\ln f_a(y)\).
For \(\eta > 0\), a loss function is called \(\eta\)-mixable [1] if for any probability distribution \(\pi\) on \(\mathcal{A}\) there exists a prediction \(a_\pi \in \mathcal{A}\) such that
\[\begin{equation}\label{eqn:mixable} e^{-\eta \ell(y,a_\pi)} \geq \int e^{-\eta \ell(y,a)} \pi(\mathrm{d} a) \qquad \text{for all $y \in \mathcal{Y}$.} \end{equation}\]The constant in the \(O(1)\) overhead compared to the best expert is proportional to \(1/\eta\), so the bigger \(\eta\) the better.
For \(\eta > 0\), a loss function is called \(\eta\)-exp-concave if for any distribution \(\pi\) on \(\mathcal{A}\) the prediction \(a_\pi = \int a \,\pi(\mathrm{d} a)\) satisfies \eqref{eqn:mixable}. So exp-concavity is just mixability with \(a_\pi\) fixed to be the mean.
This choice is appropriate in the case of log loss. In this case, for
\(\eta = 1\), the numbers \(e^{-\eta \ell(y,a)} = f_a(y)\) just equal
probability densities and \eqref{eqn:mixable} holds with equality. For squared loss, however, the appropriate choice for \(a_\pi\) is not the
mean. Suppose that \(y\) and \(a\) both take values in \([-1,+1]\). Then,
while the squared loss is \(1/2\)-mixable for \(a_\pi = \frac{h_{1/2}(-1) -
h_{1/2}(1)}{4}\) with \(h_\eta(y) = \frac{-1}{\eta} \ln \int e^{-\eta
(y-a)^2} \pi(\mathrm{d} a)\), it is only \(1/8\)-exp-concave when parameterized
by \(a\) [2], [3]. This does not
rule out, however, that the squared loss might be \(1/2\)-exp-concave in a
different parameterization. As we shall see, such a parameterization
indeed exists if we restrict \(y\) to take only two values \(\{-1,+1\}\),
but I have not been able to find a suitable reparameterization in
general.
Clearly, exp-concavity implies mixability: it just makes the choice for \(a_\pi\) explicit. What is not so obvious, is when the implication also goes the other way. It turns out that in some cases it actually does if we reparameterize our predictions in a clever (one might also say: complicated) way by the elements of a certain set \(\mathcal{B}_\eta\).
Theorem Suppose a loss \(\ell \colon \mathcal{Y} \times \mathcal{A} \to [0,\infty]\) satisfies Conditions 1 and 2 below for some \(\eta > 0\). Then \(\ell\) is \(\eta\)-mixable if and only if it can be parameterized in such a way that it is \(\eta\)-exp-concave.
The technical conditions I need are the following:
It remains to explain what these conditions mean, and discuss their severity. I will argue that Condition 1 is very mild. Condition 2 also appears to be generally satisfied if the dimensionality of the set of predictions equals the number of possible predictions minus one, i.e. \(\dim(\mathcal{A}) = |\mathcal{Y}|-1\), but not in general. For example, for the squared loss we predict by a single number \(a\), so \(\dim(\mathcal{A}) = 1\) and hence we have \(\dim(\mathcal{A}) = |\mathcal{Y}|-1\) if \(y\) only takes two different values, but not if \(\mathcal{Y}\) is the whole range \([-1,+1]\). Update: We can work around this, though. See below.
Condition 1 is the easiest of the two. I will call a prediction \(a \in \mathcal{A}\) admissible if there exists no other prediction \(b \in \mathcal{A}\) that is always at least as good in the sense that \(\ell(y,b) \leq \ell(y,a)\) for all \(y \in \mathcal{Y}\). If \(a\) is inadmissible, then we could just remove it from the set of available predictions \(\mathcal{A}\), because predicting \(b\) is always at least as good anyway. So admissibility seems more of an administrative requirement (get rid of all predictions that make no sense) than a real restriction.
To explain the second condition, we define the new parameterization \(\mathcal{B}_\eta\) as the set of functions
\[\begin{equation*} \mathcal{B}_\eta = \{g \colon \mathcal{Y} \to [0,1] \mid \text{for some distribution $\pi$: } g(y) = \int e^{-\eta \ell(y,a)} \pi(\mathrm{d} a)\ \forall y\}. \end{equation*}\]Note that the set \(\mathcal{B}_\eta\) is convex by construction.
Let \(\mathbb{1}(y) = 1\) be the constant function that is \(1\) on all \(y \in \mathcal{Y}\), and for any \(g \in \mathcal{B}_\eta\) let \(c(g) = \sup \{c \geq 0 \mid (g + c \cdot \mathbb{1}) \in \mathcal{B}_\eta\}\). By the north-east boundary of \(\mathcal{B}_\eta\), I mean the set of points \(\{g + c(g) \mid g \in \mathcal{B}_\eta \}\). That is, if we move `south-east' from any point in this set (in the direction of \(-\mathbb{1}\)), we are inside \(\mathcal{B}_\eta\), but if we move further `north-east' (in the direction of \(\mathbb{1}\)) we are outside.
Condition 2 implies that the north-east boundary of \(\mathcal{B}_\eta\) should be equal to the set \(\{e^{-\eta \ell(\cdot,a)} \mid a \in \mathcal{A}\}\), which appears to be quite typical if \(\dim(A) = |\mathcal{Y}| - 1\), but not in general.
As we have already seen that \(\eta\)-exp-concavity trivially implies \(\eta\)-mixability, it remains to construct the parameterization in which \(\ell\) is \(\eta\)-exp-concave given that it is \(\eta\)-mixable.
The parameterization we choose is indexed by the elements of \(\mathcal{B}_\eta\), which we map onto \(\mathcal{A}\), with multiple elements in \(\mathcal{B}_\eta\) mapping to the same element of \(\mathcal{A}\). So let \(g\) be an arbitrary element of \(\mathcal{B}_\eta\). How do we map it to a prediction \(a \in \mathcal{A}\)? We do this by choosing the prediction \(a\) such that \(g(y) + c(g) = e^{-\eta \ell(y,a)}\) for all \(y\). As \(g + c(g)\cdot \mathbb{1}\) lies on the north-east boundary of \(\mathcal{B}_\eta\), such a prediction exists by Condition 2.
Our construction ensures there exists a \(g \in \mathcal{B}_\eta\) that maps to \(a\) for any \(a \in \mathcal{A}\). To see this, suppose there was an \(a\) for which this was not the case, and let \(g_a = e^{-\eta \ell(\cdot,a)}\). Then we must have \(c(g_a) > 0\), because otherwise we would have \(c(g) = 0\) and \(g_a\) would map to \(a\). But then the prediction \(b \in \mathcal{A}\) such that \(e^{-\eta \ell(\cdot,b)} = g + c(g) \cdot \mathbb{1}\) would satisfy \(e^{-\eta \ell(y,b)} > e^{-\eta \ell(y,a)}\) for all \(y\), and hence \(\ell(y,b) < \ell(y,a)\) for all \(y\), so that \(a\) would be inadmissible, which we have ruled out by assumption.
We are now ready to prove that the loss is \(\eta\)-exp-concave in our parameterization. To show this, let \(\pi\) be an arbitrary probability distribution on \(\mathcal{B}_\eta\). Then we need to show that
\[\begin{equation*} e^{-\eta \ell(y,g_\pi)} \geq \int e^{-\eta \ell(y,g)} \pi(\mathrm{d} g) \qquad \text{for all $y \in \mathcal{Y}$,} \end{equation*}\]where \(g_\pi = \int g\ \pi(\mathrm{d} g)\). To this end, observe that
\[\begin{equation*} \int e^{-\eta \ell(\cdot,g)} \pi(\mathrm{d} g) = \int (g + c(g)\cdot \mathbb{1})\ \pi(\mathrm{d} g) = g_\pi + c_\pi\cdot \mathbb{1}, \end{equation*}\]where \(c_\pi = \int c(g)\ \pi(\mathrm{d} g)\). Now convexity of \(\mathcal{B}_\eta\) ensures that \(\int e^{-\eta \ell(\cdot,g)} \pi(\mathrm{d} g) \in \mathcal{B}_\eta\), so that we must have \(c_\pi \leq c(g_\pi)\). But then
\[\begin{equation*} e^{-\eta \ell(y,g_\pi)} = g_\pi(y) + c(g_\pi) \geq g_\pi(y) + c_\pi = \int e^{-\eta \ell(y,g)} \pi(\mathrm{d} g) \end{equation*}\]for all \(y\), which was to be shown.
So how do things play out for the squared loss? We know that it is \(1/2\)-mixable, so we would like to find a parameterization in which it is also \(1/2\)-exp-concave. Suppose first that \(a\) takes values in \([-1,+1]\) and \(y\) takes only two values \(\{-1,+1\}\). Then Condition 1 is clearly satisfied. The set \(\mathcal{B}_{1/2}\) consists of all the functions \(g \colon \{-1,+1\} \to [e^{-2},1]\) such that
\[\begin{equation}\label{eqn:newparam} g(y) = \int e^{-\frac{1}{2} (y-a)^2} \pi(\mathrm{d} a) \qquad \text{for $y \in \{-1,+1\}$} \end{equation}\]for some distribution \(\pi\) on \(\mathcal{A}\). So to verify Condition 2, we need to check that for any \(g \in \mathcal{B}_{1/2}\) there exists a prediction \(a_g \in \mathcal{A}\) that satisfies
\[\begin{equation}\label{eqn:squaredlosstosolve} g(y) + c(g) = e^{-\frac{1}{2}(y-a_g)^2} \qquad \text{for $y \in \{-1,+1\}$.} \end{equation}\]Solving this we find that \(a_g\) indeed exists and equals
\[\begin{equation}\label{eqn:backtoorig} a_g = f^{-1}\big(g(1)-g(-1)\big), \end{equation}\]where \(f^{-1}\) is the inverse of \(f(\alpha) = e^{-\frac{1}{2}(1-\alpha)^2} - e^{-\frac{1}{2}(\alpha+1)^2}\) (see the figure below). The existence of \(a_g\) for all \(g\) implies that Condition 2 is satisfied, and by Theorem 1 we have found a parameterization in which the squared loss is \(1/2\)-exp-concave, provided that \(y\) only takes the values \(\{-1,+1\}\).
So what happens if we allow \(y\) to vary over the whole range \([-1,+1]\)? In this case I believe that no choice of \(a_g\) will satisfy \eqref{eqn:squaredlosstosolve} for all \(y\), and consequently Condition 2 does not hold. Update: However, it turns out that any parametrization that is \(\eta\)-exp-concave for \(y \in \{-1,+1\}\) is also \(\eta\)-exp-concave for the whole range \(y \in [-1,+1]\). This is a special property, proved by Haussler, Kivinen and Warmuth [2, Lemma 4.1], [3, Lemma 3], that only holds for certain loss functions, including the squared loss. Thus we have found a parameterization of the squared loss with \(y \in [-1,+1]\) in which it is \(1/2\)-exp-concave (instead of only \(1/8\)-exp-concave like in the standard parameterization): parameterize by the functions \(g\) defined in \eqref{eqn:newparam}, and map them to original parameters via the mapping \(a_g\) defined in \eqref{eqn:backtoorig}.
We have seen that exp-concavity trivially implies mixability. Conversely, mixability also implies exp-concavity roughly when the dimensionality of the set of predictions \(\dim(\mathcal{A})\) equals the number of outcomes \(|\mathcal{Y}|\) minus one. In general, however, it remains unknown whether any \(\eta\)-mixable loss can be reparameterized to make it \(\eta\)-exp-concave with the same \(\eta\).
As exp-concavity is a stronger requirement than mixability and introduces these complicated reparameterization problems, one might ask: why bother with it at all? One answer to this is that taking \(a_\pi\) to be the mean reduces the requirement \eqref{eqn:mixable} to ordinary concavity, which has a nice geometrical interpretation. Nevertheless, the extra flexibility offered by mixability can make it easier to satisfy (for example, for the squared loss), so in general mixability would appear to be the most convenient of the two properties.
It seems the proof of Theorem 1 would still work if we replaced \(\mathbb{1}\) by any other positive function. I wonder whether this extra flexibility might make Condition 2 easier to satisfy.
Update: I would like to thank Sébastien Gerchinovitz for pointing me to Lemma 4.1 of Haussler, Kivinen and Warmuth [2].
It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…
Let \(X, X_1, ..., X_n\) be independent, identically distributed random variables, and \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\) let \(\mu(\lambda) = \log \E[e^{\lambda X}]\) be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett) can be derived [1, Appendix A.1] from a single basic result:
\[\Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) \leq e^{-n\{\lambda a - \mu(\lambda)\}}\qquad(\lambda > 0).\]This result can easily be proved using Markov’s inequality and is non-trivial as long as \(a > \E[X]\). Its ubiquity is explained by the Cramér-Chernoff theorem [2], which states that it asymptotically achieves the best possible exponent if we optimize our choice of \(\lambda\):
\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) = \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\]The empirical distribution \(P_n\) of \(X_1, \ldots, X_n\) gives probability \(P_n(A) = \frac{1}{n}\sum_{i=1}^n {\bf 1}_{\{X_i \in A\}}\) to any event \(A\), which is the fraction of variables taking their value in \(A\). If the distribution of the variables is \(P^*\), then asymptotically we would expect \(P_n\) to be close to \(P^*\). So then if \(\mathcal{P}\) is a set of distributions that is far away from \(P^*\) in some sense, the probability that \(P_n \in \mathcal{P}\) should be small. This intuition is quantified by Sanov’s theorem [3], [4].
To avoid measure-theoretic complications, let us assume that the variables \(X_1, \ldots, X_n\) are discrete, with a finite number of possible values. Suppose \(\mathcal{P}\) is a convex set of distributions and assume that the information projection
\[Q^* = \argmin_{P \in \mathcal{P}}\ D(P\|P^*)\]of \(P^*\) on \(\mathcal{P}\) exists, where \(D(P\|P^*) = \sum_x P(x) \log \frac{P(x)}{P^*(x)}\) is the Kullback-Leibler divergence. Then \(D(Q^*\|P^*)\) provides an information-theoretic measure of the “distance” of \(\mathcal{P}\) from \(P^*\), and indeed [3]
\[\Pr\Big(P_n \in \mathcal{P}\Big) \leq e^{-nD(Q^*\|P^*)}.\]Moreover, Sanov’s theorem tells us that this bound is asymptotically tight in the exponent:
\[\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(P_n \in \mathcal{P}\Big) = D(Q^*\|P^*).\]Csiszár [3, p. 790] has an elegant information-theoretic proof of the upper bound. He also works out sufficient measure-theoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.
One may extend Sanov’s theorem to non-convex sets \(\mathcal{P}\) by a union bound argument. For example, Cover and Thomas [4] take a union bound over all possible values for \(P_n\), which they call types. By discretization arguments one may further extend the theorem to infinite spaces [5], but then things get a bit too asymptotic for my taste.
It turns out that the Cramér-Chernoff theorem can be obtained from Sanov’s theorem. (This is called contraction.) The trick is to observe that
\[\E_{X \sim P_n}[X] = \frac{1}{n} \sum_{i=1}^n X_i,\]so if we define the convex set \(\mathcal{P} = \{P \mid \E_{X \sim P}[X] \geq a\}\), then
\[\Pr\Big(P_n \in \mathcal{P}\Big) = \Pr\Big(\frac{1}{n} \sum_{i=1}^n X_i \geq a\Big).\]It remains to evaluate \(D(Q^* \| P^*)\) in this case, which can be done as follows:
Introduce a Lagrange multiplier to obtain:
\[\min_{P \in \mathcal{P}} D(P\|P^*) = \min_{\{P \mid \E_{X \sim P}[X] \geq a\}} D(P\|P^*) = \min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]Use Sion’s minimax theorem to swap the order of the min and the sup:
\[\min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} = \sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.\]Recognize the following convex duality for Kullback-Leibler divergence:
\[\sup_P\ \big(\E_{X \sim P}[\lambda X] - D(P||P^*\big) = \mu(\lambda),\]where \(\mu(\lambda) = \log \E_{P^*}[e^{\lambda X}]\) is the cumulant generating function defined above. We get:
\[\begin{split}\sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} &= \sup_{\lambda > 0}\Big\{\lambda a -\sup_P\Big(\E_P[\lambda X]- D(P\|P^*)\Big)\Big\}\\&= \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\end{split}\]Chaining everything together we exactly recover the Cramér-Chernoff theorem, and we see that the upper bounds have exactly the same constants.
One may also view things the other way around. The Cramér-Chernoff theorem bounds the probability that the value of the empirical mean \(\frac{1}{n} \sum_{i=1}^n X_i\) lies in the set \(A = [a,\infty)\). As discussed by [2, pp. 208-210], both the notion of empirical mean and the set \(A\) can be generalized. In particular, one may regard the empirical distribution \(P_n\) as the mean of \(n\) point-masses (i.e. Dirac measures) on the points \(X_1, \ldots, X_n\). Van der Vaart then presents Sanov’s theorem as just one instance of such generalized Cramér-Chernoff theorems.
We have seen the close similarities between the Cramér-Chernoff theorem and Sanov’s theorem. For me Sanov’s theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measure-theoretic conditions of Csiszár [3]. For technical reasons it may therefore be preferable to use the Cramér-Chernoff result.
It turns out that the upper bound in the Cramér-Chernoff theorem does leave some slack in the order of \(1/\sqrt{n}\), which is negligible compared to the term in the exponent.