Jekyll2023-02-05T13:09:36+01:00https://www.timvanerven.nl/feed.xmlTim van ErvenTim van Erven's websiteTim van ErvenWhy FTRL Is Better than Online Mirror Descent2021-09-21T00:00:00+02:002021-09-21T00:00:00+02:00https://www.timvanerven.nl/blog/ftrl-vs-omdMany online learning algorithms come in two flavors: you can either
choose the **online mirror descent** (OMD) version or the
**follow-the-regularized-leader** (FTRL) = dual-averaging version. In
practice people usually go for OMD, which makes sense from a theoretical
point of view because it can easily adapt to the size of the gradients,
i.e. it is Lipschitz-adaptive. This comes with a big downside, however,
because OMD may perform (very!) poorly when your intermediate iterates
happen to wander away from the optimal parameters for a while, as may
easily happen with non-stationary data. FTRL, on the other hand, is much
more robust to non-stationarity, but it is not Lipschitz-adaptive by
default according to its standard analysis. What should we do? It turns
out there are actually two easy fixes to get the best of both worlds,
which both build off of FTRL: the first fix is to use a **refinement of
the standard analysis** due to Orabona and Pál to show that FTRL can be
made Lipschitz-adaptive after all; and the second fix is to combine
FTRL with a technique that we have started calling **Cutkosky clipping**
in my group, after its inventor Ashok Cutkosky. Read on for an overview
of the two approaches, which are both worth having in your theoretical
toolbox.
<!--more-->
## Setting: Online Convex Optimization
We consider the following online learning setting, which is called
$$
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator{\argmin}{arg\,min}
\DeclareMathOperator{\diameter}{diameter}
\DeclareMathOperator{\proj}{Proj}
\newcommand{\u}{\boldsymbol u}
\newcommand{\v}{\boldsymbol v}
\newcommand{\w}{\boldsymbol w}
\newcommand{\grad}{\boldsymbol g}
\newcommand{\clipgrad}{\bar \grad}
\newcommand{\tf}{\tilde f}
\newcommand{\tR}{\tilde R}
\newcommand{\domain}{\mathcal{W}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\Dmax}{D_\text{max}}
\newcommand{\Gmax}{G_\text{max}}
\newcommand{\Done}{D_1}
$$
Online Convex Optimization (OCO): Each round $$t=1,\ldots,T$$, a learner
predicts a point $$\w_t$$ from a convex domain $$\domain \subset
\reals^d$$ of diameter at most $$\diameter(\domain) \leq D$$. The
environment then selects a convex loss function $$f_t : \domain \to
\reals$$ and the learner incurs loss $$f_t(\w_t)$$. Most online learning
algorithms update their predictions efficiently, using only the gradient
$$\grad_t = \nabla f_t(\w_t)$$, and the objective is for the learner to
control the regret
$$
R_T(\u) = \sum_{t=1}^T \big(f_t(\w_t) - f_t(\u)\big)
$$
for all $$\u \in \domain$$ simultaneously.
## Two Flavors of Online Gradient Descent
For simplicity, I will focus on online gradient descent (OGD), but the
same discussion equally applies to other instances of OMD and
FTRL. I am also considering here the version of FTRL with linearized
losses, which makes it line up well with OMD. Given a starting point
$$\w_1 = 0$$ and a sequence of stepsizes $$\eta_1,\eta_2,\ldots$$, the
two versions of OGD are defined as follows:
**OMD Version**:
$$
\hspace{3em}
\begin{align*}
\w_{t+1} &= \argmin_{w \in \domain} \w^\top \grad_t + \frac{1}{2\eta_t}
\|\w - \w_t\|^2\\
&= \proj(\w_t - \eta_t \grad_t)
\end{align*}
$$
**FTRL Version**:
$$
\hspace{5em}
\begin{align*}
\w_{t+1} &= \argmin_{w \in \domain} \sum_{s=1}^t \w^\top \grad_s + \frac{1}{2\eta_t}
\|\w - \w_1\|^2\\
&= \proj(\w_1 - \eta_t \sum_{s=1}^t \grad_s),
\end{align*}
$$
where $$\proj(\tilde \w) = \argmin_{\w \in \domain} \|\w - \tilde \w\|$$
is the Euclidean projection of $$\tilde \w$$ onto the allowed domain.
(All norms in this post are standard Euclidean norms.)
The difference between the two versions is that OMD updates from the
previous iterate $$\w_t$$ using only the latest gradient $$\grad_t$$,
whereas FTRL updates from the starting point $$\w_1$$ using all the
gradients. Computationally, both versions are equally efficient, because
you end up storing a sum of gradients in both cases.
In fact, if the step size $$\eta_t = \eta$$ is constant and there are no
projections, then the two methods even coincide! This is highly
suboptimal in practice, however, where we always want to use decreasing
step sizes. To illustrate this, suppose we would set $$\eta
\leq \text{constant}/\sqrt{T}$$, then the algorithm would essentially be
stuck at its starting point $$\w_1$$ for roughly the first $$\sqrt{T}$$
rounds. But if we set $$\eta \gg 1/\sqrt{T}$$, then the iterates can
become too unstable for large $$t$$, and suffer large regret as a
result.
## Regret Bounds
On to regret bounds! The regret of OGD has been extensively analyzed in
the literature. For instance, from the proof of Theorem 1 in
[Zinkevich's well-known
paper](https://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf) it can be
seen that the OMD version satisfies the following guarantee:
**Theorem 1 (OMD Version):** *For any non-increasing step
sizes $$\eta_1\geq \eta_2 \geq \ldots$$, the OMD version of OGD
guarantees regret at most*
$$
\begin{align}\label{eqn:mdbound}
R_T(\u)
&\leq \frac{\|\w_1 - \u\|^2}{2\eta_1} + \Big(\frac{1}{2\eta_T} -
\frac{1}{2\eta_1}\Big) \max_{t \in \{2,\ldots,T\}}\|\w_t - \u\|^2 + \frac{1}{2} \sum_{t=1}^T \eta_t \|\grad_t\|^2\\
&\leq \frac{1}{2\eta_T} \max_{t \in \{1,\ldots,T\}}\|\w_t - \u\|^2 + \frac{1}{2} \sum_{t=1}^T \eta_t \|\grad_t\|^2.\label{eqn:mdbound2}
\end{align}
$$
*Consequently, if we set $$\eta_t = \frac{\Dmax}{\sqrt{\sum_{s=1}^t
\|\grad_s\|^2}}$$ for $$\Dmax \geq \max_{t \in \{1,\ldots,T\}} \|\w_t -
\u\|$$, then*
$$
R_T(\u) \leq \frac{3}{2} \Dmax \sqrt{\sum_{t=1}^T \|\grad_t\|^2}
\leq \frac{3}{2} \Dmax \Gmax \sqrt{T},
$$
*where $$\Gmax = \max_{t \in \{1,\ldots,T\}} \|\grad_t\|$$.*
A good reference to look up the standard analysis of FTRL is Corollary
7.9 in Orabona's very nice [introduction to online
learning](https://arxiv.org/abs/1912.13213), which specializes as
follows to OGD:
**Theorem 2 (FTRL Version):** *For any non-increasing step sizes
$$\eta_1\geq \eta_2 \geq \ldots$$, the FTRL version of OGD guarantees
regret at most*
$$
\begin{equation}\label{eqn:ftrlbound}
R_T(\u) \leq \frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T \eta_{t-1} \|\grad_t\|^2.
\end{equation}
$$
*Consequently, if we set $$\eta_t = \frac{\Done}{\sqrt{\Gmax^2 +
\sum_{s=1}^t \|\grad_s\|^2}}$$ for $$\Done \geq \|\w_1 - \u\|$$ with
$$\Gmax$$ as in Theorem 1, then*
$$
R_T(\u) \leq \frac{3}{2} \Done \sqrt{\Gmax^2 + \sum_{t=1}^{T-1} \|\grad_t\|^2}
\leq \frac{3}{2} \Done \Gmax \sqrt{T}.
$$
### Two Big Differences
Comparing $$\eqref{eqn:mdbound}$$ and $$\eqref{eqn:ftrlbound}$$, we see
that the bounds coincide when $$\eta_t = \eta$$ is a constant, but they
start to differ significantly in the common case that the step sizes are
decreasing. The two main differences between Theorems 1 and 2 are as
follows:
1. Advantage of FTRL over OMD: $$\Done \ll \Dmax$$
Comparing the two theorems, we see that the guarantee for OMD depends on
$$\Dmax$$ whereas the guarantee for FTRL only depends on $$\Done \leq
\Dmax$$. This matters, because, after tuning the step sizes, $$\Done$$
and $$\Dmax$$ end up as multiplicative factors in front of the regret
bounds, so they have a big impact on the regret. To see that the
difference can be significant, note that, if the domain $$\domain$$ is
large compared to $$\Done$$ (or even unbounded) and the data are
non-stationary, then the intermediate iterates can move all over the
place and consequently we will have that $$\Dmax \gg \Done$$, implying
that OMD is much worse than FTRL. Moreover, we can directly influence
$$\Done$$ by changing the starting position $$\w_1$$ of the algorithm,
but $$\Dmax$$ is not easy to control, because it depends on all the
iterates of the algorithm, which are heavily dependent on the data. Now
we might still hope that the difference between $$\Done$$ and $$\Dmax$$
is just an artifact of the bounds and that things won't be as bad in
practice, but this hope turns out to be vain, because [Orabona and
Pál](https://www.sciencedirect.com/science/article/abs/pii/S0304397517308514)
(Theorem 3) construct a counterexample where the difference is real.
{:start="2"}
2. Advantage of OMD over FTRL: Lipschitz-adaptivity
There is another difference between Theorems 1 and 2, which is so subtle
that it is easily overlooked when reading the literature: compared to
$$\eqref{eqn:mdbound}$$ the step sizes in the sum $$\sum_{t=1}^T
\eta_{t-1} \|\grad_t\|^2$$ in $$\eqref{eqn:ftrlbound}$$ are one time
step behind.[^timediff] At first sight, this may seem like a minor difference, but
it is in fact crucial if we do not know the sizes of our gradients in
advance (and in practice we never do!). To see this, note that for OMD
the sum can be controlled by setting
[^timediff]: The difference between $$\eta_t$$ and $$\eta_{t-1}$$ is extra difficult to spot in the literature, because different notational conventions sometimes shift the subscript $$t$$ on $$\eta_t$$ by one.
$$
\eta_t \propto \frac{1}{\sqrt{\sum_{s=1}^t \|\grad_s\|^2}}.
$$
But getting the same control in FTRL would require
$$
\eta_t \propto \frac{1}{\sqrt{\sum_{s=1}^{t+1} \|\grad_s\|^2}},
$$
which is impossible because $$\|\grad_{t+1}\|$$ is unknown at the time
of choosing $$\eta_t$$. The solution shown in Theorem 2 is to fall back
on
$$
\eta_t \propto \frac{1}{\sqrt{\Gmax^2 + \sum_{s=1}^{t} \|\grad_s\|^2}},
$$
which replaces $$\|\grad_{t+1}\|$$ by the upper bound $$\Gmax$$. But
this is not really a solution at all, because it leaves us stuck with
$$\Gmax$$ as a hyper-parameter that we need to set. Now a seasoned
machine learning practitioner may not easily be intimidated by having to
tune yet another hyper-parameter, but $$\Gmax$$ is particularly nasty.
Indeed, $$\Gmax$$ is a moving target, because changing its estimate
changes the iterates $$\w_t$$ of the algorithm, which in turn changes
the gradients $$\grad_t$$ that it encounters, which changes $$\Gmax$$!
## Solutions
Are we then to live with the choice between a nasty dependence on
$$\Dmax$$ (for OMD) or an awful hyper-parameter $$\Gmax$$ (for FTRL)?
Luckily not, and there are even two possible ways out! In line with the
title of this post, both of these build on FTRL, by resolving the
hyper-parameter tuning issue.
### Solution 1: Improve the Analysis of FTRL
The first solution, which is perhaps the most comforting, is due to
[Orabona and
Pál](https://www.sciencedirect.com/science/article/abs/pii/S0304397517308514).
It turns out that there is really nothing wrong with FTRL per se, but
the problem arises only in the analysis. As shown in the proof of
Orabona and Pál's Theorem 1, it is possible to refine
$$\eqref{eqn:ftrlbound}$$ as follows:
$$
\begin{align}
R_T(\u)
&\leq \frac{\|\w_1 - \u\|^2}{2\eta_T}
+ \frac{1}{2} \sum_{t=1}^T
\min\Big\{\eta_{t-1} \|\grad_t\|^2, \eta_t \|\grad_t\|^2 +
2\Dmax^* \|\grad_t\|\Big\},\notag\\
&\leq \frac{\|\w_1 - \u\|^2}{2\eta_T}
+ \frac{1}{2} \sum_{t=1}^T
\eta_t \|\grad_t\|^2
+ \frac{1}{2} \sum_{t=1}^T
\min\Big\{\eta_{t-1} \|\grad_t\|^2, 2\Dmax^* \|\grad_t\|\Big\},
\label{eqn:opbound}
\end{align}
$$
where $$\Dmax^* = \max_{\v,\w \in \domain} \|\v - \w\|$$. The result of
the new minimum in the bound is that the effect of not including $$\|\grad_{t+1}\|$$
in $$\eta_t$$ is mitigated, and that we can get away with step sizes
$$
\eta_t = \frac{\Done}{\sqrt{\sum_{s=1}^t \|\grad_s\|^2}}.
$$
Plugging this into $$\eqref{eqn:opbound}$$, a standard summation
analogous to the one used in Theorem 1 leads to
$$
R_T(\u) \leq \frac{3}{2} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2}
+ \frac{1}{2} \sum_{t=1}^T
\min\Big\{\eta_{t-1} \|\grad_t\|^2, 2\Dmax^* \|\grad_t\|\Big\}.
$$
To handle the remaining sum, Orabona and Pál provide the following
useful inequality (their Lemma 3):
**Lemma 3:** *Let $$C, a_1, a_2, \ldots, a_T \geq 0$$. Then*
$$
\sum_{t=1}^T \min \Big\{\frac{a_t^2}{\sqrt{\sum_{s=1}^{t-1} a_s^2}}, C a_t\Big\}
\leq 3.5 \sqrt{\sum_{t=1}^T a_t^2}
+ 3.5 C \max_{t \in \{1,\ldots,T\}} a_t.
$$
Applying this with $$a_t = \|\grad_t\|$$ and $$C = 2\Dmax^*/\Done$$, we find that
$$
\begin{align}
R_T(\u)
&\leq \frac{3}{2} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2}
+ \frac{\Done}{2}
\Big(
3.5 \sqrt{\sum_{t=1}^T \|\grad_t\|^2}
+ 7 \frac{\Dmax}{\Done} \Gmax
\Big)\notag\\
&= \frac{13}{4} \Done \sqrt{\sum_{t=1}^T \|\grad_t\|^2}
+ \frac{7}{2} \Dmax \Gmax.\label{eqn:refinedbound}
\end{align}
$$
This is great, because we tuned the step sizes without prior knowledge
of the gradient lengths, so we got Lipschitz-adaptivity, and the leading
part of the bound also scales with $$\Done$$ as desired! The fact that
$$\Dmax$$ does appear in the constant term at the end is also fine as
long as our domain is not exceedingly large.[^unbounded] The only
downside is that our leading constant has increased by slightly more
than a factor of $$2$$ from $$3/2$$ to $$13/4$$. Are we being too greedy
if we complain about such a minor remaining imperfection? Certainly not,
because even this increase can be avoided using the second solution
presented below!
[^unbounded]: Orabona and Pál even provide a refined analysis for unbounded domains, which replaces the dependence on $$\Dmax^*$$ by a term of order $$\sqrt{T}$$.
### Solution 2: Cutkosky Clipping
The second solution does not require any improvement to the analysis of
Theorem 2. Instead, we will combine $$\eqref{eqn:ftrlbound}$$ with a
technique to clip the observed gradients, which we call **Cutkosky
clipping**. I first learned about it from Ashok Cutkosky on the bus at
[COLT 2018](http://learningtheory.org/colt2018/), and he later published
it in a paper about [combining Lipschitz adaptivity with unbounded
domains](http://proceedings.mlr.press/v99/cutkosky19a.html). The
technique is both amazingly simple and super useful to have in your
toolbox at the same time, and I remember being blown away that such a
simple idea had not been discovered before.
To explain the approach, let $$G_t = \max_{s \leq t} \|\grad_s\|$$ with
boundary case $$G_0 = 0$$, and define the clipped gradients
$$
\clipgrad_t = \frac{G_{t-1}}{G_t} \grad_t
=
\begin{cases}
\grad_t & \text{if $\|\grad_t\| \leq G_{t-1}$}\\
G_{t-1} \frac{\grad_t}{\|\grad_t\|} & \text{otherwise.}
\end{cases}
$$
These are just the original gradients, but with their lengths truncated
to the previous largest observed length $$G_{t-1}$$. The result is that,
for $$\clipgrad_t$$, we know one time-step in advance what its maximum
length is going to be.
How does this help us? Well, many online learning algorithms, including
the FTRL version of OGD, are based on linear approximations $$\tf_t(\w)
= \w^\top \grad_t$$ of the losses, which enter in the analysis as
$$
f_t(\w_t) - f_t(\u) \leq \tf_t(\w_t) - \tf_t(\u)
\qquad \text{for any convex $f_t$}.
$$
Summing over $$t$$, it follows that
$$
R_T(\u) \leq \tR_T(\u) := \sum_{t=1}^T \big(\tf_t(\w_t) - \tf_t(\u)\big),
$$
and all the bounds from Theorems 1 and 2 are also upper bounds on
$$\tR_T(\u)$$. Moreover, these bounds do not make any assumptions about
where the vectors $$\grad_t$$ are coming from, as long as the same
vectors are used consistently in the algorithms and their analysis. But
this means that we can also apply the algorithms and bounds to the
clipped gradients $$\clipgrad_t$$. Keeping track of the approximation
error when replacing $$\grad_t$$ by $$\clipgrad_t$$,
inequality $$\eqref{eqn:ftrlbound}$$ for the FTRL version of OGD then gives us
$$
\begin{align*}
R_T(\u)
&\leq \tR_T(\u)
= \sum_{t=1}^T \Big(\w_t^\top \clipgrad_t - \u^\top \clipgrad_t\Big)
+ \sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t)\\
&\leq \frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T
\eta_{t-1} \|\clipgrad_t\|^2
+ \sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t).
\end{align*}
$$
Since the lengths of the clipped gradients satisfy $$\|\clipgrad_t\|
\leq G_{t-1}$$, we can tune the step sizes as
$$
\begin{equation}\label{eqn:clipstep}
\eta_t = \frac{D_1}{\sqrt{G_t^2 + \sum_{s=1}^t \|\clipgrad_s\|^2}}
\end{equation}
$$
to get
$$
\frac{\|\w_1 - \u\|^2}{2\eta_{T-1}} + \frac{1}{2} \sum_{t=1}^T
\eta_{t-1} \|\clipgrad_t\|^2
\leq \frac{3}{2} D_1 \sqrt{G_{T-1}^2 + \sum_{t=1}^{T-1} \|\clipgrad_t\|^2}
\leq \frac{3}{2} D_1 \Gmax \sqrt{T}.
$$
It remains to control the overhead introduced by clipping, which can be
done as follows:
$$
\begin{align*}
\sum_{t=1}^T (\w_t - \u)^\top (\grad_t - \clipgrad_t)
&= \sum_{t=1}^T (\w_t - \u)^\top \grad_t \Big(1 - \frac{G_{t-1}}{G_t}\Big)\\
&\leq \Dmax \sum_{t=1}^T \|\grad_t\| \Big(1 - \frac{G_{t-1}}{G_t}\Big)
\leq \Dmax \sum_{t=1}^T G_t \Big(1 - \frac{G_{t-1}}{G_t}\Big)\\
&= \Dmax \sum_{t=1}^T \Big(G_t - G_{t-1}\Big)
= \Dmax \Gmax.
\end{align*}
$$
Putting everything together, we find that:
**Theorem 4:** *Running the FTRL version of OGD on the clipped gradients
$$\clipgrad_t$$ with step sizes as in $$\eqref{eqn:clipstep}$$
guarantees regret at most*
$$
R_T(\u) \leq \frac{3}{2} D_1 \Gmax \sqrt{T} + \Dmax \Gmax.
$$
This is the same as $$\eqref{eqn:refinedbound}$$, but now even
the constant factors turn out as nice as we could hope for!
## Conclusion
We have seen that the FTRL version of OGD is better than the OMD
version, because its regret depends only on the distance
$$\Done$$ of the first iterate from the optimal parameters, as opposed to
the maximum distance over all the iterates $$\Dmax$$ for OMD. Comparing
the standard regret bounds for FTRL and OMD suggests that OMD might
still have the advantage in adapting to the sizes of the gradients, but
it turns out that this can be remedied for FTRL, either by improving the
standard analysis
([Solution 1](#solution-1-improve-the-analysis-of-ftrl)) or by combining
the standard analysis with Cutkosky clipping of the gradients
([Solution 2](#solution-2-cutkosky-clipping)).
### Skeletons in the Closet
There is one topic that I have completely glossed over so far, which is
how to adapt to the parameter $$\Done$$ that still appears in the tuning
of the step sizes. This is actually not a straightforward issue, but
through a long series of works it is by now quite well understood. The
upshot and a selection of key references are as follows:
* As discussed in this post, you can have Lipschitz-adaptivity without
adaptivity to $$\Done$$.
* You can also have adaptivity to $$\Done$$ if you give up on
Lipschitz-adaptivity [[Cutkosky and Orabona, 2018](http://proceedings.mlr.press/v75/cutkosky18a.html)].
* Having both Lipschitz-adaptivity and adaptivity to $$\Done$$ at the
same time is not possible [[Cutkosky and
Boahen, 2017](http://proceedings.mlr.press/v65/cutkosky17a.html)]...
* ... except if you pay an additional cubic additive constant of order
$$O(\Done^3)$$ in the rates...
[[Cutkosky, 2019](http://proceedings.mlr.press/v99/cutkosky19a.html)],
[[Mhammedi and
Koolen, 2020](http://proceedings.mlr.press/v125/mhammedi20a.html)]
* ... or if you make extra structural assumptions about your losses
$$f_t$$ [[Kempka, Kotłowski and Warmuth, 2019](http://proceedings.mlr.press/v97/kempka19a.html)].
That's it. Thanks for reading!Tim van ErvenMany online learning algorithms come in two flavors: you can either choose the online mirror descent (OMD) version or the follow-the-regularized-leader (FTRL) = dual-averaging version. In practice people usually go for OMD, which makes sense from a theoretical point of view because it can easily adapt to the size of the gradients, i.e. it is Lipschitz-adaptive. This comes with a big downside, however, because OMD may perform (very!) poorly when your intermediate iterates happen to wander away from the optimal parameters for a while, as may easily happen with non-stationary data. FTRL, on the other hand, is much more robust to non-stationarity, but it is not Lipschitz-adaptive by default according to its standard analysis. What should we do? It turns out there are actually two easy fixes to get the best of both worlds, which both build off of FTRL: the first fix is to use a refinement of the standard analysis due to Orabona and Pál to show that FTRL can be made Lipschitz-adaptive after all; and the second fix is to combine FTRL with a technique that we have started calling Cutkosky clipping in my group, after its inventor Ashok Cutkosky. Read on for an overview of the two approaches, which are both worth having in your theoretical toolbox.Online Learning: How to Filter Out Outliers2021-08-26T00:00:00+02:002021-08-26T00:00:00+02:00https://www.timvanerven.nl/blog/robust-online-learningFor the [latest version of the MetaGrad
algorithm](https://jmlr.org/papers/v22/20-1444.html) {% cite VanErvenKoolenVanDerHoeven2021 %}, we did some pretty
extensive experiments, and this really drove home the point with me that
online learning algorithms only work well in practice if they can
automatically adapt to the size of the gradients, as opposed to fixing
an upper bound $$G$$ on the maximum gradient length in advance. This
probably also explains the success of
[AdaGrad](https://jmlr.org/papers/v12/duchi11a.html) {% cite DuchiHazanSinger2011 %},
[AdaHedge](http://jmlr.org/papers/v15/rooij14a.html) {% cite DeRooijVanErvenGrunwaldKoolen2014 %}, etc.
But if you do not enforce a bound on the gradient lengths, then you make
yourself vulnerable to outliers: a single extreme data point (e.g. with
a large gradient) can completely break the online learner. How do we
defend against such outliers? In this post, I describe a simple (but
optimal) filtering strategy that can be applied to any existing online
learning algorithm. This is based on our recent [COLT
paper](http://proceedings.mlr.press/v134/vanerven21a/vanerven21a.pdf) {%
cite VanErvenSachsKoolenKotlowski2021 %}
with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also
the [COLT recorded
videos](http://www.learningtheory.org/colt2021/virtual/poster_1390.html)
if you prefer.
<!--more-->
{% include figure image_path="/assets/blog/2021/EffectOfOutlier.jpg"
alt="this is a placeholder image" caption="A single outlier can
completely change the decision boundary learned by (offline) logistic
regression. The same problem occurs in online learning." %}
## Motivation: Reasons for Outliers
Before getting into the technical side of things, it is helpful to
consider several reasons why outliers might occur:
1. **Naturally Heavy-tailed Data**: The distribution of your data might be
heavy-tailed, meaning that extreme data points occur naturally with
small probability.
2. **Malicious Users**: If each data-point corresponds to a different
user, then a small number of users may be trying to poison the data
stream by deliberately generating adversarial inputs. Think, e.g., of
spammers generating malformatted e-mails.
3. **Sensor Glitches**: With the increasing prominence of cheap sensors,
measurement errors can occasionally generate extreme inputs.
To deal with all such cases simultaneously, we want to avoid making any
assumptions about the nature of the outliers. This motivates the
game-theoretic approach described below.
$$
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator{\argmin}{arg\,min}
\DeclareMathOperator{\diameter}{diameter}
\DeclareMathOperator{\risk}{Risk}
\newcommand{\u}{\boldsymbol u}
\newcommand{\w}{\boldsymbol w}
\newcommand{\grad}{\boldsymbol g}
\newcommand{\domain}{\mathcal{W}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\gradlist}{\mathcal{L}}
$$
## The Robust Regret Setting
Let's start by fixing the setting, which will be a variant of the
standard online convex optimization framework: Each round
$$t=1,\ldots,T$$, the learner predicts a point $$\w_t$$ from a convex
domain $$\domain \subset \reals^d$$ of diameter at most
$$\diameter(\domain) \leq D$$. The
environment then selects a convex loss function $$f_t : \domain \to \reals$$
and the learner incurs loss $$f_t(\w_t)$$. Most online learning algorithms
update their predictions efficiently, using only the gradient $$\grad_t
= \nabla f_t(\w_t)$$, and the standard objective is for the learner to
control the regret
$$
R_T(\u) = \sum_{t=1}^T \big(f_t(\w_t) - f_t(\u)\big)
\qquad \textrm{(standard regret)}
$$
for all $$\u \in \domain$$ simultaneously.
Suppose now that only a subset of rounds $$\cS \subseteq [T] :=
\{1,\ldots,T\}$$ consist of reliable inliers and the other rounds $$[T]
\setminus \cS$$ are all outliers that our learning algorithm should
ignore. Then the objective will be to control the **robust regret**:
$$
R_T(\u,\cS) = \sum_{t \in \cS} \big(f_t(\w_t) - f_t(\u)\big)
\qquad \textrm{(robust regret)}.
$$
Note the difference, which is that we are measuring performance only on **inlier
rounds in $$\cS$$**.
### Main Challenges
The robust regret setup creates the following challenges:
1. $$\cS$$ is unknown, and may in fact be chosen
adversarially to maximally confuse the learner.
2. We will assume that the maximum length $$G(\cS) = \max_{t \in \cS}
\|\grad_t\|$$ of the inlier gradients is reasonable, but we make no
assumptions whatsoever about the (lengths of) outlier gradients
$$\grad_t$$ for $$t \not \in \cS$$.
Since we do not assume that outliers behave reasonably in any way, it
follows that bounds on the robust regret can only depend on the scale
$$G(\cS)$$ of the inliers, but **not on the lengths of the outliers!**
## Approach: Top-k Filtering
So what can we do? We will take a very general approach, in which we
modify any existing online learner ALG by filtering out some rounds with
extreme gradients. By filtering a round, I mean that we pretend to ALG
that the round did not happen, so we do not feed it any gradient and
ALG's prediction remains unchanged.
We will assume that there are at most $$k$$ outliers,
so
$$
T - |\cS| \leq k.
$$
Then the filtering strategy, which we call **top-k filter**, works as
follows:
* Maintain a list $$\gradlist_t$$ of the $$k+1$$ largest gradient lengths
seen during rounds $$1,\ldots,t$$.
* Filter out round $$t$$ if $$\|\grad_t\| \geq 2 \min \gradlist_t$$;
otherwise pass the round on to ALG.
Note the factor 2 in front of $$\min \gradlist_t$$ in the filtering
threshold!
## Analysis: Are We Happy?
One thing to like about top-k filter is that it is a nice and simple
approach, but of course true happiness can only be found if its
performance is any good. In the paper, we first show the following
general guarantee for top-k filtering when the losses are linear, i.e.
$$f_t(\w) = \w^\top \grad_t$$.
**Theorem 1** (Linear Losses). *Suppose losses are linear, and
ALG achieves regret bound $$B_T(G)$$ when running on at most $$T$$
rounds with gradients of length at most $$G$$. Then ALG + top-k filter
guarantees that the robust regret does not exceed*
$$
R_T(\u,\cS) \leq B_T(2 G(\cS)) + 4 D G(\cS)(k+1)
\qquad
\textrm{for any $\cS: T - |\cS| \leq k$.}
$$
See [below](#proof-ideas-behind-theorem-1) for the main ideas behind the
proof.
The first term, $$B_T(2 G(\cS))$$, comes from feeding ALG gradients of
length at most $$2 G(\cS)$$, which it turns out is guaranteed by top-k
filter. The second term of order $$O(G(\cS) k)$$ is the price we pay
for achieving robustness. Note that it scales with the size of the
inlier gradients $$G(\cS)$$ only, and does not depend on the outliers at
all!
At first sight, it might appear restrictive that the bound only applies
to linear losses, but we can easily get around this using a standard
reduction from general convex losses to the linear case via the
inequality $$f_t(\w_t) - f_t(\u) \leq (\w_t - \u)^\top \grad_t$$. There
exists a similar reduction for so-called strongly convex losses. By
instantiating ALG to be the standard online gradient descent algorithm,
this leads to the following guarantees on the robust regret:
| Losses | Minimax Standard Regret | Minimax Robust Regret |
|:------------------------|:-----------------------:|:---------------------:|
| General Convex Losses | $$O(\sqrt{T})$$ | $$O(\sqrt{T} + k)$$ |
| Strongly Convex Losses | $$O(\ln(T))$$ | $$O(\ln(T) + k)$$ |
These rates turn out to be optimal, because we prove matching lower bounds
both for general convex losses and for strongly convex losses. In fact,
for general convex losses the lower bound construction uses independent,
identically distributed (i.i.d.) losses, so even if we assume that the
losses are coming from a nice fixed distribution, then our upper bound
is still optimal. We see that the optimal price of robustness is an additive $$O(k)$$ in quite some generality, and top-k filter achieves it.
So are we happy? Well, the pursuit of happiness is a fool's errant. We
should rather aim to live a fulfilling life, but certainly there is
nothing more fulfilling than optimality!
### Proof Ideas Behind Theorem 1
The main ideas behind top-k filtering and Theorem 1 are as follows:
**1. Top-k filter ensures that we never pass ALG any gradients whose length exceeds $$2 G(\cS)$$.**
To see this, note that:
* $$\gradlist_t$$ must contain at least one inlier, because it has
length $$k+1$$ and there are at most $$k$$ outliers.
* Hence $$\min \gradlist_t \leq G(\cS)$$, and our filtering threshold is
$$2\min \gradlist_t$$.
**2. The overhead for filtering is $$O(k)$$.**
We can account for incorrectly filtered rounds $$t \in \cS$$ as follows:
* Every filtered round must have a large gradient, whose length is added
to $$\gradlist_t$$.
* By the factor 2 in the filtering threshold, $$\min \gradlist_t$$
therefore (at least) doubles every $$k+1$$ filtered rounds.
* Because of this exponential growth of $$\min \gradlist_t$$, the last
$$k+1$$ incorrectly filtered rounds dominate in the bound.
## Robustified Online-to-Batch Conversion for Huber ε-Contamination
There is one more result from the paper that I would like to highlight,
which is that the robust regret can be used for robust learning in the
Huber ε-contamination setting via a robustified variant of the usual
online-to-batch conversion trick.
In the Huber ε-contamination model, our losses are of the form $$f_t(\w)
= f(\w,\xi_t)$$, where the underlying data $$\xi_1,\ldots,\xi_T$$ are
sampled i.i.d. from a mixture distribution $$P_\epsilon$$:
$$
P_\epsilon = (1-\epsilon) P + \epsilon Q.
$$
The interpretation is that observations from the distribution of
interest $$P$$ are corrupted by observations from some unknown outlier
distribution $$Q$$ with probability $$\epsilon$$. This is a
batch-learning setting, in which the goal is to output a single set of
parameters $$\bar \w_T$$ that achieve small risk under $$P$$:
$$
\risk_P(\w) = \E_{\xi \sim P}[f(\w,\xi)].
$$
When $$\epsilon = 0$$ (i.e., there are no outliers), the standard
[online-to-batch conversion
approach](https://ieeexplore.ieee.org/abstract/document/1327806) {% cite CesaBianchiConconiGentile2004 %} allows
us to average the predictions $$\w_1,\ldots,\w_T$$ of any online
learning algorithm according to
$$
\bar \w_T = \frac{1}{T} \sum_{t=1}^T \w_t
$$
with the guarantee that
$$
\E_P[\risk_P(\bar \w_T) - \risk_P(\u_P)]
\leq \frac{\E_P[R_T(\u_P)]}{T},
$$
where $$\u_P = \argmin_{\u \in \domain} \risk_P(\u)$$.
It turns out that this can be generalized to the case $$\epsilon > 0$$,
in which there are outliers, if we replace the standard regret by the robust regret:
$$
\E_{P_\epsilon}[\risk_P(\bar \w_T) - \risk_P(\u_P)]
\leq \frac{\E_{P_\epsilon}[R_T(\u_P,\cS^*)]}{(1-\epsilon)T},
$$
where the inliers $$\cS^*$$ are those rounds in which we receive
a sample from the distribution of interest $$P$$. Notice that now the
expectations are under the mixture distribution $$P_\epsilon$$, but the
risk is measured with respect to $$P$$! If the losses or gradients of
the inliers are bounded $$P$$-almost surely, then an analogous result
also holds with high probability.
For instance, we show the following corollary, which achieves rates
for Huber ε-contamination that are optimal under its assumptions:
**Corollary.** *Suppose $$\|\nabla f(\w,\xi)\| \leq G$$ when $$\xi \sim
P$$ is an inlier. Then online gradient descent with top-k filtering
achieves*
$$
\risk_P(\bar \w_T) - \risk_P(\u_P)
= O\Big(DG \epsilon + DG \sqrt{\frac{\ln(1/\delta)}{T}}\Big)
$$
*with $$P_\epsilon$$-probability at least $$1-\delta$$, for $$k$$ tuned
suitably depending on $$\epsilon,T$$ and $$\delta \in (0,1]$$.*
## Conclusion
In this post, I have described a filtering approach that provides a way
to robustify any online learning algorithm ALG by filtering out large
gradients using top-k filter. This approach turns out to be optimal in
the new robust regret setting, which further allows a type of
robustified online-to-batch conversion for the Huber ε-contamination
model.
## References
{% bibliography --cited %}Tim van ErvenFor the latest version of the MetaGrad algorithm [1], we did some pretty extensive experiments, and this really drove home the point with me that online learning algorithms only work well in practice if they can automatically adapt to the size of the gradients, as opposed to fixing an upper bound \(G\) on the maximum gradient length in advance. This probably also explains the success of AdaGrad [2], AdaHedge [3], etc. But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper [4] with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.PAC-Bayes Mini-tutorial: A Continuous Union Bound2013-12-18T00:00:00+01:002013-12-18T00:00:00+01:00https://www.timvanerven.nl/blog/pac-bayes-mini-tutorial-a-continuous-union-boundWhen I first encountered PAC-Bayesian concentration inequalities they
seemed to me to be rather disconnected from good old-fashioned results
like Hoeffding's and Bernstein's inequalities. But, at least for one
flavour of the PAC-Bayesian bounds, there is actually a very close
relation, and the main innovation is a continuous version of the union
bound, along with some ingenious applications. Here's the gist of what's
going on, presented from a machine learning perspective.
<!--more-->
NB On the website I can only provide a high-level sneak preview of the
tutorial for brevity. For the full version, which is still quite short,
but includes the proofs and fills in the missing details, see
[PAC-BayesMiniTutorial.pdf](/assets/blog/2013/pac-bayes-mini-tutorial-a-continuous-union-bound/PAC-BayesMiniTutorial.pdf).
$$
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator*{\argmin}{arg\,min}
$$
## The Cramér-Chernoff Method
I will start by outlining the Cramér-Chernoff method, from which
Hoeffding's and Bernstein's inequalities and many others follow. This
method is incredibly well explained in Appendix A of the textbook by
Cesa-Bianchi and Lugosi {% cite CesaBianchiLugosi2006 %}, but I will
have to change the presentation a little to easily connect with the
PAC-Bayesian bounds later on.
Let $$D =((X_1,Y_1),\ldots,(X_n,Y_n))$$ be *independent, identically
distributed* (i.i.d.) examples, and let $$h$$ be a
*hypothesis* from a set of hypotheses $$\mathcal{H}$$, which gets
loss $$\ell(X_i,Y_i,h)$$ on the $$i$$-th example. For example, we might
think of the squared loss $$\ell(X_i,Y_i,h) = (Y_i - h(X_i))^2$$. We also
define the *empirical error* [^risk] of $$h$$
[^risk]: Called the empirical *risk* in statistics; hence the notation with `R'.
$$R_n(D,h) = \frac{1}{n} \sum_{i=1}^n \ell(X_i,Y_i,h),$$
and our goal is to prove that the empirical error is close to the *generalisation error*
$$R(h) = \E[\ell(X,Y,h)]$$
with high probability. To do this, we define the function
$$M_\eta(h)
= -\tfrac{1}{\eta} \ln \E\Big[e^{-\eta \ell(X,Y,h)}\Big]
\qquad \text{for $\eta > 0$,}$$
which will act as a surrogate for $$R(h)$$. Now the Cramér-Chernoff method
tells us that:
**Lemma 1.** *For any $$\eta > 0$$, $$\delta \in (0,1]$$,
\begin{equation}\label{eqn:chernoff}
M_\eta(h) \leq R_n(h,D) + \frac{1}{\eta n}\ln \frac{1}{\delta}
\end{equation}with probability at least $$1-\delta$$.*
Many standard concentration inequalities can then be derived by first relating $$M_\eta(h)$$ to $$R(h)$$, and then optimizing $$\eta$$. This includes, for example, Hoeffding's inequality and inequalities involving the variance like Bernstein's inequality. See the <a href="/assets/blog/2013/pac-bayes-mini-tutorial-a-continuous-union-bound/PAC-BayesMiniTutorial.pdf">full version</a> of this post for details.
## The Union Bound
Now suppose we use an estimator $$\hat{h} \equiv \hat{h}(D) \in \mathcal{H}$$ to pick a hypothesis based on the data, for example using empirical risk minimization: $$\hat{h} = \argmin_{h \in \mathcal{H}} R_n(D,h)$$. To get a bound for $$\hat{h}$$ instead of a fixed $$h$$, we want \eqref{eqn:chernoff} to hold for all $$h \in \mathcal{H}$$ simultaneously. If $$\mathcal{H}$$ is countable, this can be done using the union bound, which leads to:
<strong>Lemma 4.</strong> Suppose $$\mathcal{H}$$ is countable. For $$h \in \mathcal{H}$$, let $$\pi(h)$$ be any numbers such that $$\pi(h) \geq 0$$ and $$\sum_h \pi(h)= 1$$. Then, for any $$\eta > 0$$, $$\delta \in (0,1]$$, \begin{equation}
M_\eta(\hat{h}) \leq R_n(D,\hat{h}) + \frac{1}{\eta n}\ln
\frac{1}{\pi(\hat{h})\delta}
\end{equation} with probability at least $$1-\delta$$.
In this context, the function $$\pi$$ is often referred to as a *prior distribution*, even though it need not have anything to do with prior beliefs.
Just like for Lemma 1, we can then again relate $$M_\eta(h)$$ to $$R(h)$$ to obtain a bound on the generalisation error. This shows, in a nutshell, how one can combine the Cramér-Chernoff method with the union bound to obtain concentration inequalities for estimators $$\hat{h}$$. The use of the union bound, however, is quite crude when there are multiple hypotheses in $$\mathcal{H}$$ with very similar losses, and the current proof breaks down completely if we want to extend it to continuous classes $$\mathcal{H}$$. This is where PAC-Bayesian bounds come to the rescue: in the next section I will explain the PAC-Bayesian generalisation of Lemma 4 to continuous hypothesis classes $$\mathcal{H}$$, which will require replacing $$\hat{h}$$ by a randomized estimator.
## PAC-Bayesian Concentration
Let $$\hat{\pi} \equiv \hat{\pi}(D)$$ be a distribution on $$\mathcal{H}$$ that depends on the data $$D$$, which we will interpret as a randomized estimator: instead of choosing $$\hat{h}$$ deterministically, we will sample $$h \sim \hat{\pi}$$ randomly. The distribution $$\hat{\pi}$$ is often called the PAC-Bayesian *posterior distribution*. Now the result that the PAC-Bayesians have, may be expressed as follows:
<strong>Lemma 6.</strong> Let $$\pi$$ be a (prior) distribution on $$\mathcal{H}$$ that does not depend on $$D$$, and let $$\hat{\pi}$$ be a randomized estimator that is allowed to depend on $$D$$. Then, for any $$\eta > 0$$, $$\delta \in (0,1]$$, \begin{equation}\label{eqn:pacbayes}
\E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}\Big(D(\hat{\pi}\|\pi) + \ln \frac{1}{\delta}\Big)
\end{equation} with probability at least $$1-\delta$$. Moreover, \begin{equation}\label{eqn:pacbayesexp}
\E_D \E_{h \sim \hat{\pi}}[M_\eta(h)] \leq \E_D\Big[ \E_{h \sim \hat{\pi}}[R_n(D,h)] +
\frac{1}{\eta n}D(\hat{\pi}\|\pi)\Big].
\end{equation}
Here $$D(\hat{\pi}\|\pi) = \int \hat{\pi}(h) \ln \frac{\hat{\pi}(h)}{\pi(h)} \mathrm{d} h$$ denotes the Kullback-Leibler divergence of $$\hat{\pi}$$ from $$\pi$$.
To see that Lemma 6 generalises Lemma 4, suppose that $$\hat{\pi}$$ is a point-mass on $$\hat{h}$$. Then $$D(\hat{\pi}\|\pi) = \ln (1/\pi(\hat{h}))$$, and we recover Lemma 4 as a special case of \eqref{eqn:pacbayes}. An important difference with Lemma 4, however, is that Lemma 6 does not require $$\mathcal{H}$$ to be countable, and in fact in many PAC-Bayesian applications it is not.
## Keep Reading...
We have seen how PAC-Bayesian inequalities naturally extend standard concentration inequalities based on the Cramér-Chernoff method by generalising the union bound to a continuous version. I'm afraid that's all that will reasonably fit into a single blog post on my website. If you want more, simply continue with the <a href="/assets/blog/2013/pac-bayes-mini-tutorial-a-continuous-union-bound/PAC-BayesMiniTutorial.pdf">full version</a>, which covers several more issues, including:
* How to relate $$M_\eta(h)$$ to $$R(h)$$ to obtain, for example, Hoeffding's inequality.
* How to optimize $$\eta$$, which comes "for free" in Lemma 4, but requires an additional union bound in the general PAC-Bayesian case of Lemma 6.
* How the PAC-Bayesians choose their prior $$\pi$$ and posterior $$\hat{\pi}$$. (There's an unexpected trick called *localisation*.)
* Proofs
* Literature references
## References
{% bibliography --cited %}Tim van ErvenWhen I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.From Exp-concavity to Mixability2012-12-11T00:00:00+01:002012-12-11T00:00:00+01:00https://www.timvanerven.nl/blog/from-exp-concavity-to-mixabilityIn sequential prediction (online learning) with expert advice the goal
is to predict a sequence of outcomes almost as well as the best advisor
from a pool of experts. The quality of predictions is measured by a
<em>loss function</em>, which is determined by the application one has
in mind. For most loss functions, the best performance that can be
guaranteed is to be within $$O(\sqrt{T})$$ of the best expert on $$T$$
outcomes, but for some special loss functions $$O(1)$$ overhead is
possible.
In the 1990's, people familiar with the work of Vovk called these
special loss functions <em>mixable losses</em>, but nowadays the notion
of mixability appears to be mostly forgotten, and the geometric concept
of <em>exp-concavity</em> has taken its place. This raises the question
of how the two are related, which strangely does not appear to be
answered in very much detail in the literature. As I have been studying
mixability quite a bit in my recent work, I was wondering about this, so
here are some thoughts. <strong>Update:</strong> In particular, I will
construct a parameterization of the squared loss in which it is
$$1/2$$-exp-concave instead of only $$1/8$$-exp-concave like in its usual
parameterization.
<!--more-->
This post is also available as <a href="/assets/blog/2012/from-exp-concavity-to-mixability/FromExpConcavityToMixability.pdf">FromExpConcavityToMixability.pdf</a>.
## Mixability and Exp-concavity
Suppose we predict an outcome $$y \in \mathcal{Y}$$ by specifying a
prediction $$a \in \mathcal{A}$$. The better our prediction, the smaller
our loss $$\ell(y,a)$$. (I will assume $$\ell(y,a)$$ is nonnegative, but
that does not really matter.) For example, if $$y$$ and $$a$$ both take
values in $$\{0,1\}$$, then the <em>$$0/1$$-loss</em> $$\ell(y,a) = |y-a|$$ is
$$0$$ if we predict correctly and $$1$$ otherwise. Alternatively, if $$y$$ and
$$a$$ are both real-valued, then the <em>squared loss</em> is $$\ell(y,a) =
(y-a)^2$$. And finally, if $$a$$ specifies a probability density $$f_a$$ on
$$\mathcal{Y}$$, then our loss may be the <em>log loss</em> $$\ell(y,a) =
-\ln f_a(y)$$.
### Mixability
For $$\eta > 0$$, a loss function is called <em>$$\eta$$-mixable</em> {% cite CesaBianchiLugosi2006 %} if for
any probability distribution $$\pi$$ on $$\mathcal{A}$$ there exists a
prediction $$a_\pi \in \mathcal{A}$$ such that
$$
\begin{equation}\label{eqn:mixable}
e^{-\eta \ell(y,a_\pi)}
\geq \int e^{-\eta \ell(y,a)} \pi(\mathrm{d} a)
\qquad \text{for all $y \in \mathcal{Y}$.}
\end{equation}
$$
The constant in the $$O(1)$$ overhead compared to the best expert is
proportional to $$1/\eta$$, so the bigger $$\eta$$ the better.
### Exp-concavity
For $$\eta > 0$$, a loss function is called $$\eta$$-exp-concave if for any
distribution $$\pi$$ on $$\mathcal{A}$$ the prediction $$a_\pi = \int a
\,\pi(\mathrm{d} a)$$ satisfies \eqref{eqn:mixable}. So exp-concavity is just mixability with $$a_\pi$$ fixed to be the mean.
This choice is appropriate in the case of log loss. In this case, for
$$\eta = 1$$, the numbers $$e^{-\eta \ell(y,a)} = f_a(y)$$ just equal
probability densities and \eqref{eqn:mixable} holds with equality. For squared loss, however, the appropriate choice for $$a_\pi$$ is not the
mean. Suppose that $$y$$ and $$a$$ both take values in $$[-1,+1]$$. Then,
while the squared loss is $$1/2$$-mixable for $$a_\pi = \frac{h_{1/2}(-1) -
h_{1/2}(1)}{4}$$ with $$h_\eta(y) = \frac{-1}{\eta} \ln \int e^{-\eta
(y-a)^2} \pi(\mathrm{d} a)$$, it is only $$1/8$$-exp-concave when parameterized
by $$a$$ {% cite HausslerKivinenWarmuth1998 %}, {% cite Vovk2001 %}. This does not
rule out, however, that the squared loss might be $$1/2$$-exp-concave in a
different parameterization. As we shall see, such a parameterization
indeed exists <del datetime="2013-12-03T10:21:39+00:00">if we restrict $$y$$ to take only two values $$\{-1,+1\}$$,
but I have not been able to find a suitable reparameterization in
general</del>.
## Relations
Clearly, exp-concavity implies mixability: it just makes the choice for
$$a_\pi$$ explicit. What is not so obvious, is when the implication also
goes the other way. It turns out that in some cases it actually does
if we reparameterize our predictions in a clever (one
might also say: complicated) way by the elements of a certain set
$$\mathcal{B}_\eta$$.
<strong>Theorem</strong>
<em>Suppose a loss $$\ell \colon \mathcal{Y} \times \mathcal{A} \to
[0,\infty]$$ satisfies Conditions 1 and
2 below for some $$\eta > 0$$. Then $$\ell$$ is $$\eta$$-mixable
if and only if it can be parameterized in such a way that it is
$$\eta$$-exp-concave.</em>
The technical conditions I need are the following:
1. All predictions in $$\mathcal{A}$$ should be
admissible.
2. For any element $$g$$ on the north-east boundary of the set
$$\mathcal{B}_\eta$$, there should exist a prediction $$a \in \mathcal{A}$$
such that $$g(y) = e^{-\eta \ell(y,a)}$$ for all $$y$$.
It remains to explain what these conditions mean, and discuss their
severity. I will argue that Condition 1 is very
mild. Condition 2 also appears to be generally satisfied
if the dimensionality of the set of predictions equals the number of
possible predictions minus one, i.e. $$\dim(\mathcal{A}) =
|\mathcal{Y}|-1$$, but not in general. For example, for the squared loss
we predict by a single number $$a$$, so $$\dim(\mathcal{A}) = 1$$ and hence
we have $$\dim(\mathcal{A}) = |\mathcal{Y}|-1$$ if $$y$$ only takes two
different values, but not if $$\mathcal{Y}$$ is the whole range $$[-1,+1]$$. <strong>Update:</strong> We can work around this, though. See below.
## The Technical Conditions
### Admissibility
Condition 1 is the easiest of the two. I will call a
prediction $$a \in \mathcal{A}$$ <em>admissible</em> if there exists no
other prediction $$b \in \mathcal{A}$$ that is always at least as good in
the sense that $$\ell(y,b) \leq \ell(y,a)$$ for all $$y \in \mathcal{Y}$$.
If $$a$$ is inadmissible, then we could just remove it from the set of
available predictions $$\mathcal{A}$$, because predicting $$b$$ is always at
least as good anyway. So admissibility seems more of an administrative
requirement (get rid of all predictions that make no sense) than a real
restriction.
### Condition 2
To explain the second condition, we define the new parameterization
$$\mathcal{B}_\eta$$ as the set of functions
$$
\begin{equation*}
\mathcal{B}_\eta = \{g \colon \mathcal{Y} \to [0,1] \mid \text{for some
distribution $\pi$: } g(y) = \int e^{-\eta \ell(y,a)} \pi(\mathrm{d} a)\
\forall y\}.
\end{equation*}
$$
Note that the set $$\mathcal{B}_\eta$$ is convex by construction.
Let $$\mathbb{1}(y) = 1$$ be the constant function that is $$1$$ on all $$y \in
\mathcal{Y}$$, and for any $$g \in \mathcal{B}_\eta$$ let $$c(g) = \sup \{c
\geq 0 \mid (g + c \cdot \mathbb{1}) \in \mathcal{B}_\eta\}$$. By the
<em>north-east boundary</em> of $$\mathcal{B}_\eta$$, I mean the set of points
$$\{g + c(g) \mid g \in \mathcal{B}_\eta \}$$. That is, if we move
\`south-east\' from any point in this set (in the direction of $$-\mathbb{1}$$),
we are inside $$\mathcal{B}_\eta$$, but if we move further \`north-east\' (in
the direction of $$\mathbb{1}$$) we are outside.
Condition 2 implies that the north-east boundary of
$$\mathcal{B}_\eta$$ should be equal to the set $$\{e^{-\eta \ell(\cdot,a)}
\mid a \in \mathcal{A}\}$$, which appears to be quite typical if $$\dim(A)
= |\mathcal{Y}| - 1$$, but not in general.
## Construction of the Parameterization and Proof
As we have already seen that $$\eta$$-exp-concavity trivially implies
$$\eta$$-mixability, it remains to construct the parameterization in which
$$\ell$$ is $$\eta$$-exp-concave given that it is $$\eta$$-mixable.
The parameterization we choose is indexed by the elements of
$$\mathcal{B}_\eta$$, which we map onto $$\mathcal{A}$$, with multiple
elements in $$\mathcal{B}_\eta$$ mapping to the same element of
$$\mathcal{A}$$. So let $$g$$ be an arbitrary element of $$\mathcal{B}_\eta$$.
How do we map it to a prediction $$a \in \mathcal{A}$$? We do this by
choosing the prediction $$a$$ such that $$g(y) + c(g) = e^{-\eta
\ell(y,a)}$$ for all $$y$$. As $$g + c(g)\cdot \mathbb{1}$$ lies on the north-east
boundary of $$\mathcal{B}_\eta$$, such a prediction exists by
Condition 2.
Our construction ensures there exists a $$g \in \mathcal{B}_\eta$$ that
maps to $$a$$ for any $$a \in \mathcal{A}$$. To see this, suppose there was
an $$a$$ for which this was not the case, and let $$g_a = e^{-\eta
\ell(\cdot,a)}$$. Then we must have $$c(g_a) > 0$$, because otherwise we
would have $$c(g) = 0$$ and $$g_a$$ would map to $$a$$. But then the
prediction $$b \in \mathcal{A}$$ such that $$e^{-\eta \ell(\cdot,b)} = g +
c(g) \cdot \mathbb{1}$$ would satisfy $$e^{-\eta \ell(y,b)} > e^{-\eta
\ell(y,a)}$$ for all $$y$$, and hence $$\ell(y,b) < \ell(y,a)$$ for all
$$y$$, so that $$a$$ would be inadmissible, which we have ruled out by
assumption.
We are now ready to prove that the loss is $$\eta$$-exp-concave in our
parameterization. To show this, let $$\pi$$ be an arbitrary probability
distribution on $$\mathcal{B}_\eta$$. Then we need to show that
$$
\begin{equation*}
e^{-\eta \ell(y,g_\pi)}
\geq \int e^{-\eta \ell(y,g)} \pi(\mathrm{d} g)
\qquad \text{for all $y \in \mathcal{Y}$,}
\end{equation*}
$$
where $$g_\pi = \int g\ \pi(\mathrm{d} g)$$. To this end, observe that
$$
\begin{equation*}
\int e^{-\eta \ell(\cdot,g)} \pi(\mathrm{d} g)
= \int (g + c(g)\cdot \mathbb{1})\ \pi(\mathrm{d} g)
= g_\pi + c_\pi\cdot \mathbb{1},
\end{equation*}
$$
where $$c_\pi = \int c(g)\ \pi(\mathrm{d} g)$$. Now convexity of
$$\mathcal{B}_\eta$$ ensures that $$\int e^{-\eta \ell(\cdot,g)} \pi(\mathrm{d} g)
\in \mathcal{B}_\eta$$, so that we must have $$c_\pi \leq c(g_\pi)$$. But
then
$$
\begin{equation*}
e^{-\eta \ell(y,g_\pi)} = g_\pi(y) + c(g_\pi)
\geq g_\pi(y) + c_\pi =
\int e^{-\eta \ell(y,g)} \pi(\mathrm{d} g)
\end{equation*}
$$
for all $$y$$, which was to be shown.
## Squared Loss
So how do things play out for the squared loss? We know that it is
$$1/2$$-mixable, so we would like to find a parameterization in which it
is also $$1/2$$-exp-concave. Suppose first that $$a$$ takes values in
$$[-1,+1]$$ and $$y$$ takes only two values $$\{-1,+1\}$$. Then
Condition 1 is clearly satisfied. The set
$$\mathcal{B}_{1/2}$$ consists of all the functions $$g \colon \{-1,+1\} \to
[e^{-2},1]$$ such that
$$
\begin{equation}\label{eqn:newparam}
g(y) = \int e^{-\frac{1}{2} (y-a)^2} \pi(\mathrm{d} a) \qquad \text{for $y
\in \{-1,+1\}$}
\end{equation}
$$
for some distribution $$\pi$$ on $$\mathcal{A}$$. So to verify
Condition 2, we need to check that for any $$g \in
\mathcal{B}_{1/2}$$ there exists a prediction $$a_g \in \mathcal{A}$$ that
satisfies
$$
\begin{equation}\label{eqn:squaredlosstosolve}
g(y) + c(g) = e^{-\frac{1}{2}(y-a_g)^2} \qquad \text{for
$y \in \{-1,+1\}$.}
\end{equation}
$$
Solving this we find that $$a_g$$ indeed exists and equals
$$
\begin{equation}\label{eqn:backtoorig}
a_g = f^{-1}\big(g(1)-g(-1)\big),
\end{equation}
$$
where $$f^{-1}$$ is the inverse of $$f(\alpha) =
e^{-\frac{1}{2}(1-\alpha)^2} - e^{-\frac{1}{2}(\alpha+1)^2}$$ (see
the figure below). The existence of $$a_g$$ for all $$g$$ implies
that Condition 2 is satisfied, and by
Theorem 1 we have found a parameterization
in which the squared loss is $$1/2$$-exp-concave, provided that $$y$$ only
takes the values $$\{-1,+1\}$$.
<img width="500" height="337" src="/assets/blog/2012/from-exp-concavity-to-mixability/ImageSquaredLoss.png" alt="Plot of function f" />
So what happens if we allow $$y$$ to vary over the whole range $$[-1,+1]$$?
In this case I believe that no choice of $$a_g$$ will satisfy
\eqref{eqn:squaredlosstosolve} for all $$y$$, and consequently
Condition 2 does not hold. <strong>Update:</strong> However, it
turns out that any parametrization that is $$\eta$$-exp-concave for $$y \in
\{-1,+1\}$$ is also $$\eta$$-exp-concave for the whole range $$y \in
[-1,+1]$$. This is a special property, proved by Haussler, Kivinen and
Warmuth {% cite HausslerKivinenWarmuth1998 --label lemma -l Lemma 4.1 %},
{% cite Vovk2001 --label lemma -l Lemma 3 %}, that only holds for certain loss functions,
including the squared loss. Thus we have found a parameterization of the
squared loss with $$y \in [-1,+1]$$ in which it is $$1/2$$-exp-concave
(instead of only $$1/8$$-exp-concave like in the standard
parameterization): parameterize by the functions $$g$$ defined in
\eqref{eqn:newparam}, and map them to original parameters via the
mapping $$a_g$$ defined in \eqref{eqn:backtoorig}.
## Discussion
We have seen that exp-concavity trivially implies mixability.
Conversely, mixability also implies exp-concavity roughly when the
dimensionality of the set of predictions $$\dim(\mathcal{A})$$ equals the
number of outcomes $$|\mathcal{Y}|$$ minus one. In general, however, it
remains unknown whether any $$\eta$$-mixable loss can be reparameterized
to make it $$\eta$$-exp-concave with the same $$\eta$$.
As exp-concavity is a stronger requirement than mixability and
introduces these complicated reparameterization problems, one might ask:
why bother with it at all? One answer to this is that taking $$a_\pi$$ to
be the mean reduces the requirement \eqref{eqn:mixable} to ordinary
concavity, which has a nice geometrical interpretation. Nevertheless,
the extra flexibility offered by mixability can make it easier to
satisfy (for example, for the squared loss), so in general mixability
would appear to be the most convenient of the two properties.
### Afterthought
It seems the proof of Theorem 1 would still
work if we replaced $$\mathbb{1}$$ by any other positive function. I wonder
whether this extra flexibility might make Condition 2
easier to satisfy.
### Acknowledgements
<strong>Update:</strong> I would like to thank Sébastien Gerchinovitz for
pointing me to Lemma 4.1 of Haussler, Kivinen and Warmuth {%
cite HausslerKivinenWarmuth1998 %}.
## References
{% bibliography --cited %}Tim van ErvenIn sequential prediction (online learning) with expert advice the goal is to predict a sequence of outcomes almost as well as the best advisor from a pool of experts. The quality of predictions is measured by a loss function, which is determined by the application one has in mind. For most loss functions, the best performance that can be guaranteed is to be within \(O(\sqrt{T})\) of the best expert on \(T\) outcomes, but for some special loss functions \(O(1)\) overhead is possible. In the 1990’s, people familiar with the work of Vovk called these special loss functions mixable losses, but nowadays the notion of mixability appears to be mostly forgotten, and the geometric concept of exp-concavity has taken its place. This raises the question of how the two are related, which strangely does not appear to be answered in very much detail in the literature. As I have been studying mixability quite a bit in my recent work, I was wondering about this, so here are some thoughts. Update: In particular, I will construct a parameterization of the squared loss in which it is \(1/2\)-exp-concave instead of only \(1/8\)-exp-concave like in its usual parameterization.Large deviations: Cramér vs Sanov2012-08-03T00:00:00+02:002012-08-03T00:00:00+02:00https://www.timvanerven.nl/blog/large-deviations-cramer-vs-sanovI have recently been reading up on two classical results from large deviation theory: the *Cramér-Chernoff theorem* and *Sanov's theorem*. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.
It turns out that the two are more or less equivalent, but while Sanov's theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain...
<!--more-->
## The Cramér-Chernoff Theorem
Let $$X, X_1, ..., X_n$$ be independent, identically distributed random variables, and
$$
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator*{\argmin}{arg\,min}
$$
let $$\mu(\lambda) = \log \E[e^{\lambda X}]$$ be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett)
can be derived {% cite CesaBianchiLugosi2006 --label appendix --locator Appendix A.1 %} from a single basic result:
$$\Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) \leq e^{-n\{\lambda a - \mu(\lambda)\}}\qquad(\lambda > 0).$$
This result can easily be proved using Markov's inequality and is non-trivial as long as $$a > \E[X]$$. Its ubiquity is explained by the *Cramér-Chernoff theorem* {% cite VanDerVaart1998 %}, which states that it asymptotically achieves the best possible exponent if we optimize our choice of $$\lambda$$:
$$\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) = \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.$$
## Sanov's Theorem
The *empirical distribution* $$P_n$$ of $$X_1, \ldots, X_n$$ gives probability $$P_n(A) = \frac{1}{n}\sum_{i=1}^n {\bf 1}_{\{X_i \in A\}}$$ to any event $$A$$, which is the fraction of variables taking their value in $$A$$. If the distribution of the variables is $$P^*$$, then asymptotically we would expect $$P_n$$ to be close to $$P^*$$. So then if $$\mathcal{P}$$ is a set of distributions that is far away from $$P^*$$ in some sense, the probability that $$P_n \in \mathcal{P}$$ should be small. This intuition is quantified by *Sanov's theorem* {% cite Csiszar1984 %}, {% cite CoverThomas2006 %}.
To avoid measure-theoretic complications, let us assume that the variables $$X_1, \ldots, X_n$$ are discrete, with a finite number of possible values. Suppose $$\mathcal{P}$$ is a *convex* set of distributions and assume that the *information projection*
$$Q^* = \argmin_{P \in \mathcal{P}}\ D(P\|P^*)$$
of $$P^*$$ on $$\mathcal{P}$$ exists, where $$D(P\|P^*) = \sum_x P(x) \log \frac{P(x)}{P^*(x)}$$ is the *Kullback-Leibler divergence*. Then
$$D(Q^*\|P^*)$$ provides an information-theoretic measure of the
"distance" of $$\mathcal{P}$$ from $$P^*$$, and indeed {% cite Csiszar1984 %}
$$\Pr\Big(P_n \in \mathcal{P}\Big) \leq e^{-nD(Q^*\|P^*)}.$$
Moreover, *Sanov's theorem* tells us that this bound is asymptotically tight in the exponent:
$$\lim_{n \to \infty} -\frac{1}{n}\log \Pr\Big(P_n \in \mathcal{P}\Big) = D(Q^*\|P^*).$$
Csiszár {% cite Csiszar1984 -A -l 790 %} has an elegant information-theoretic proof of the upper bound. He also works out sufficient measure-theoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.
One may extend Sanov's theorem to non-convex sets $$\mathcal{P}$$ by a union bound argument. For example, Cover and Thomas {% cite CoverThomas2006 -A %} take a union bound over all possible values for $$P_n$$, which they call types. By discretization arguments one may further extend the theorem to infinite spaces {% cite DenHollander2000 %}, but then things get a bit too asymptotic for my taste.
## From Sanov to Cramér-Chernoff
It turns out that the Cramér-Chernoff theorem can be obtained from Sanov's theorem. (This is called *contraction*.) The trick is to observe that
$$\E_{X \sim P_n}[X] = \frac{1}{n} \sum_{i=1}^n X_i,$$
so if we define the convex set $$\mathcal{P} = \{P \mid \E_{X \sim P}[X] \geq a\}$$, then
$$\Pr\Big(P_n \in \mathcal{P}\Big) = \Pr\Big(\frac{1}{n} \sum_{i=1}^n X_i \geq a\Big).$$
It remains to evaluate $$D(Q^* \| P^*)$$ in this case, which can be done as follows:
1. *Introduce a Lagrange multiplier* to obtain:
$$\min_{P \in \mathcal{P}} D(P\|P^*) = \min_{\{P \mid \E_{X \sim P}[X] \geq a\}} D(P\|P^*) = \min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.$$
2. *Use Sion's minimax theorem* to swap the order of the min and the sup:
$$\min_P \sup_{\lambda > 0} \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} = \sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\}.$$
3. Recognize the following *convex duality for Kullback-Leibler divergence*:
$$\sup_P\ \big(\E_{X \sim P}[\lambda X] - D(P||P^*\big) = \mu(\lambda),$$
where $$\mu(\lambda) = \log \E_{P^*}[e^{\lambda X}]$$ is the cumulant generating function defined above. We get:
$$\begin{split}\sup_{\lambda > 0}\inf_P \Big\{D(P\|P^*) - \lambda (\E_P[X] - a)\Big\} &= \sup_{\lambda > 0}\Big\{\lambda a -\sup_P\Big(\E_P[\lambda X]- D(P\|P^*)\Big)\Big\}\\&= \sup_{\lambda > 0}\ \{\lambda a - \mu(\lambda)\}.\end{split}$$
Chaining everything together we exactly recover the Cramér-Chernoff theorem, and we see that the upper bounds have exactly the same constants.
## From Cramér-Chernoff to Sanov
One may also view things the other way around. The Cramér-Chernoff theorem bounds the probability that the value of the empirical mean $$\frac{1}{n} \sum_{i=1}^n X_i$$ lies in the set $$A = [a,\infty)$$. As discussed by {% cite VanDerVaart1998 -l 208-210 %}, both the notion of empirical mean and the set $$A$$ can be generalized. In particular, one may regard the empirical distribution $$P_n$$ as the mean of $$n$$ point-masses (i.e. Dirac measures) on the points $$X_1, \ldots, X_n$$. Van der Vaart then presents Sanov's theorem as just one instance of such generalized Cramér-Chernoff theorems.
## Conclusion
We have seen the close similarities between the Cramér-Chernoff theorem and Sanov's theorem. For me Sanov's theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measure-theoretic conditions of Csiszár {% cite Csiszar1984 -A %}. For technical reasons it may therefore be preferable to use the Cramér-Chernoff result.
## Last Words
It turns out that the upper bound in the Cramér-Chernoff theorem does
leave some slack in the order of $$1/\sqrt{n}$$, which is negligible
compared to the term in the exponent. <!--
See the <a href="http://courses.engr.illinois.edu/ece561/spring08/large-deviations.pdf">lecture notes</a> by <a href="http://vvv.ece.illinois.edu/">Venugopal Veeravalli</a> for details. -->
## References
{% bibliography --cited %}Tim van ErvenI have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent. It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…