Online Learning: How to Filter Out Outliers

Posted on

10 minute read

For the latest version of the MetaGrad algorithm [1], we did some pretty extensive experiments, and this really drove home the point with me that online learning algorithms only work well in practice if they can automatically adapt to the size of the gradients, as opposed to fixing an upper bound \(G\) on the maximum gradient length in advance. This probably also explains the success of AdaGrad [2], AdaHedge [3], etc.

But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper [4] with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.

this is a placeholder image
A single outlier can completely change the decision boundary learned by (offline) logistic regression. The same problem occurs in online learning.

Motivation: Reasons for Outliers

Before getting into the technical side of things, it is helpful to consider several reasons why outliers might occur:

  1. Naturally Heavy-tailed Data: The distribution of your data might be heavy-tailed, meaning that extreme data points occur naturally with small probability.
  2. Malicious Users: If each data-point corresponds to a different user, then a small number of users may be trying to poison the data stream by deliberately generating adversarial inputs. Think, e.g., of spammers generating malformatted e-mails.
  3. Sensor Glitches: With the increasing prominence of cheap sensors, measurement errors can occasionally generate extreme inputs.

To deal with all such cases simultaneously, we want to avoid making any assumptions about the nature of the outliers. This motivates the game-theoretic approach described below. \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\argmin}{arg\,min} \DeclareMathOperator{\diameter}{diameter} \DeclareMathOperator{\risk}{Risk} \newcommand{\u}{\boldsymbol u} \newcommand{\w}{\boldsymbol w} \newcommand{\grad}{\boldsymbol g} \newcommand{\domain}{\mathcal{W}} \newcommand{\reals}{\mathbb{R}} \newcommand{\cS}{\mathcal{S}} \newcommand{\gradlist}{\mathcal{L}}\)

The Robust Regret Setting

Let’s start by fixing the setting, which will be a variant of the standard online convex optimization framework: Each round \(t=1,\ldots,T\), the learner predicts a point \(\w_t\) from a convex domain \(\domain \subset \reals^d\) of diameter at most \(\diameter(\domain) \leq D\). The environment then selects a convex loss function \(f_t : \domain \to \reals\) and the learner incurs loss \(f_t(\w_t)\). Most online learning algorithms update their predictions efficiently, using only the gradient \(\grad_t = \nabla f_t(\w_t)\), and the standard objective is for the learner to control the regret

\[R_T(\u) = \sum_{t=1}^T \big(f_t(\w_t) - f_t(\u)\big) \qquad \textrm{(standard regret)}\]

for all \(\u \in \domain\) simultaneously.

Suppose now that only a subset of rounds \(\cS \subseteq [T] := \{1,\ldots,T\}\) consist of reliable inliers and the other rounds \([T] \setminus \cS\) are all outliers that our learning algorithm should ignore. Then the objective will be to control the robust regret:

\[R_T(\u,\cS) = \sum_{t \in \cS} \big(f_t(\w_t) - f_t(\u)\big) \qquad \textrm{(robust regret)}.\]

Note the difference, which is that we are measuring performance only on inlier rounds in \(\cS\).

Main Challenges

The robust regret setup creates the following challenges:

  1. \(\cS\) is unknown, and may in fact be chosen adversarially to maximally confuse the learner.
  2. We will assume that the maximum length \(G(\cS) = \max_{t \in \cS} \|\grad_t\|\) of the inlier gradients is reasonable, but we make no assumptions whatsoever about the (lengths of) outlier gradients \(\grad_t\) for \(t \not \in \cS\).

Since we do not assume that outliers behave reasonably in any way, it follows that bounds on the robust regret can only depend on the scale \(G(\cS)\) of the inliers, but not on the lengths of the outliers!

Approach: Top-k Filtering

So what can we do? We will take a very general approach, in which we modify any existing online learner ALG by filtering out some rounds with extreme gradients. By filtering a round, I mean that we pretend to ALG that the round did not happen, so we do not feed it any gradient and ALG’s prediction remains unchanged.

We will assume that there are at most \(k\) outliers, so

\[T - |\cS| \leq k.\]

Then the filtering strategy, which we call top-k filter, works as follows:

  • Maintain a list \(\gradlist_t\) of the \(k+1\) largest gradient lengths seen during rounds \(1,\ldots,t\).
  • Filter out round \(t\) if \(\|\grad_t\| \geq 2 \min \gradlist_t\); otherwise pass the round on to ALG.

Note the factor 2 in front of \(\min \gradlist_t\) in the filtering threshold!

Analysis: Are We Happy?

One thing to like about top-k filter is that it is a nice and simple approach, but of course true happiness can only be found if its performance is any good. In the paper, we first show the following general guarantee for top-k filtering when the losses are linear, i.e. \(f_t(\w) = \w^\top \grad_t\).

Theorem 1 (Linear Losses). Suppose losses are linear, and ALG achieves regret bound \(B_T(G)\) when running on at most \(T\) rounds with gradients of length at most \(G\). Then ALG + top-k filter guarantees that the robust regret does not exceed

\[R_T(\u,\cS) \leq B_T(2 G(\cS)) + 4 D G(\cS)(k+1) \qquad \textrm{for any $\cS: T - |\cS| \leq k$.}\]

See below for the main ideas behind the proof.

The first term, \(B_T(2 G(\cS))\), comes from feeding ALG gradients of length at most \(2 G(\cS)\), which it turns out is guaranteed by top-k filter. The second term of order \(O(G(\cS) k)\) is the price we pay for achieving robustness. Note that it scales with the size of the inlier gradients \(G(\cS)\) only, and does not depend on the outliers at all!

At first sight, it might appear restrictive that the bound only applies to linear losses, but we can easily get around this using a standard reduction from general convex losses to the linear case via the inequality \(f_t(\w_t) - f_t(\u) \leq (\w_t - \u)^\top \grad_t\). There exists a similar reduction for so-called strongly convex losses. By instantiating ALG to be the standard online gradient descent algorithm, this leads to the following guarantees on the robust regret:

Losses Minimax Standard Regret Minimax Robust Regret
General Convex Losses \(O(\sqrt{T})\) \(O(\sqrt{T} + k)\)
Strongly Convex Losses \(O(\ln(T))\) \(O(\ln(T) + k)\)

These rates turn out to be optimal, because we prove matching lower bounds both for general convex losses and for strongly convex losses. In fact, for general convex losses the lower bound construction uses independent, identically distributed (i.i.d.) losses, so even if we assume that the losses are coming from a nice fixed distribution, then our upper bound is still optimal. We see that the optimal price of robustness is an additive \(O(k)\) in quite some generality, and top-k filter achieves it.

So are we happy? Well, the pursuit of happiness is a fool’s errant. We should rather aim to live a fulfilling life, but certainly there is nothing more fulfilling than optimality!

Proof Ideas Behind Theorem 1

The main ideas behind top-k filtering and Theorem 1 are as follows:

1. Top-k filter ensures that we never pass ALG any gradients whose length exceeds \(2 G(\cS)\).

To see this, note that:

  • \(\gradlist_t\) must contain at least one inlier, because it has length \(k+1\) and there are at most \(k\) outliers.
  • Hence \(\min \gradlist_t \leq G(\cS)\), and our filtering threshold is \(2\min \gradlist_t\).

2. The overhead for filtering is \(O(k)\).

We can account for incorrectly filtered rounds \(t \in \cS\) as follows:

  • Every filtered round must have a large gradient, whose length is added to \(\gradlist_t\).
  • By the factor 2 in the filtering threshold, \(\min \gradlist_t\) therefore (at least) doubles every \(k+1\) filtered rounds.
  • Because of this exponential growth of \(\min \gradlist_t\), the last \(k+1\) incorrectly filtered rounds dominate in the bound.

Robustified Online-to-Batch Conversion for Huber ε-Contamination

There is one more result from the paper that I would like to highlight, which is that the robust regret can be used for robust learning in the Huber ε-contamination setting via a robustified variant of the usual online-to-batch conversion trick.

In the Huber ε-contamination model, our losses are of the form \(f_t(\w) = f(\w,\xi_t)\), where the underlying data \(\xi_1,\ldots,\xi_T\) are sampled i.i.d. from a mixture distribution \(P_\epsilon\):

\[P_\epsilon = (1-\epsilon) P + \epsilon Q.\]

The interpretation is that observations from the distribution of interest \(P\) are corrupted by observations from some unknown outlier distribution \(Q\) with probability \(\epsilon\). This is a batch-learning setting, in which the goal is to output a single set of parameters \(\bar \w_T\) that achieve small risk under \(P\):

\[\risk_P(\w) = \E_{\xi \sim P}[f(\w,\xi)].\]

When \(\epsilon = 0\) (i.e., there are no outliers), the standard online-to-batch conversion approach [5] allows us to average the predictions \(\w_1,\ldots,\w_T\) of any online learning algorithm according to

\[\bar \w_T = \frac{1}{T} \sum_{t=1}^T \w_t\]

with the guarantee that

\[\E_P[\risk_P(\bar \w_T) - \risk_P(\u_P)] \leq \frac{\E_P[R_T(\u_P)]}{T},\]

where \(\u_P = \argmin_{\u \in \domain} \risk_P(\u)\).

It turns out that this can be generalized to the case \(\epsilon > 0\), in which there are outliers, if we replace the standard regret by the robust regret:

\[\E_{P_\epsilon}[\risk_P(\bar \w_T) - \risk_P(\u_P)] \leq \frac{\E_{P_\epsilon}[R_T(\u_P,\cS^*)]}{(1-\epsilon)T},\]

where the inliers \(\cS^*\) are those rounds in which we receive a sample from the distribution of interest \(P\). Notice that now the expectations are under the mixture distribution \(P_\epsilon\), but the risk is measured with respect to \(P\)! If the losses or gradients of the inliers are bounded \(P\)-almost surely, then an analogous result also holds with high probability.

For instance, we show the following corollary, which achieves rates for Huber ε-contamination that are optimal under its assumptions:

Corollary. Suppose \(\|\nabla f(\w,\xi)\| \leq G\) when \(\xi \sim P\) is an inlier. Then online gradient descent with top-k filtering achieves

\[\risk_P(\bar \w_T) - \risk_P(\u_P) = O\Big(DG \epsilon + DG \sqrt{\frac{\ln(1/\delta)}{T}}\Big)\]

with \(P_\epsilon\)-probability at least \(1-\delta\), for \(k\) tuned suitably depending on \(\epsilon,T\) and \(\delta \in (0,1]\).


In this post, I have described a filtering approach that provides a way to robustify any online learning algorithm ALG by filtering out large gradients using top-k filter. This approach turns out to be optimal in the new robust regret setting, which further allows a type of robustified online-to-batch conversion for the Huber ε-contamination model.


  1. T. van Erven, W. M. Koolen, and D. van der Hoeven, “MetaGrad: Adaptation using Multiple Learning Rates in Online Learning,” Journal of Machine Learning Research, vol. 22, no. 161, pp. 1–61, 2021, [Online]. Available at:
  2. J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
  3. S. de Rooij, T. van Erven, P. D. Grünwald, and W. M. Koolen, “Follow the Leader If You Can, Hedge If You Must,” Journal of Machine Learning Research, vol. 15, pp. 1281–1316, 2014.
  4. T. van Erven, S. Sachs, W. M. Koolen, and W. Kotłowski, “Robust Online Convex Optimization in the Presence of Outliers,” in Proceedings of 34th Conference on Learning Theory, 2021, vol. 134, pp. 4174–4194, [Online]. Available at:
  5. N. Cesa-Bianchi, A. Conconi, and C. Gentile, “On the generalization ability of on-line learning algorithms,” IEEE Transactions on Information Theory, vol. 50, no. 9, pp. 2050–2057, 2004.