# Online Learning

This is an overview of all posts in the category Online Learning.

Posted on

Summary

## Online Learning: How to Filter Out Outliers

Posted on

Summary

For the latest version of the MetaGrad algorithm [1], we did some pretty extensive experiments, and this really drove home the point with me that online learning algorithms only work well in practice if they can automatically adapt to the size of the gradients, as opposed to fixing an upper bound $G$ on the maximum gradient length in advance. This probably also explains the success of AdaGrad [2], AdaHedge [3], etc. But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper [4] with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.

In sequential prediction (online learning) with expert advice the goal is to predict a sequence of outcomes almost as well as the best advisor from a pool of experts. The quality of predictions is measured by a loss function, which is determined by the application one has in mind. For most loss functions, the best performance that can be guaranteed is to be within $O(\sqrt{T})$ of the best expert on $T$ outcomes, but for some special loss functions $O(1)$ overhead is possible. In the 1990’s, people familiar with the work of Vovk called these special loss functions mixable losses, but nowadays the notion of mixability appears to be mostly forgotten, and the geometric concept of exp-concavity has taken its place. This raises the question of how the two are related, which strangely does not appear to be answered in very much detail in the literature. As I have been studying mixability quite a bit in my recent work, I was wondering about this, so here are some thoughts. Update: In particular, I will construct a parameterization of the squared loss in which it is $1/2$-exp-concave instead of only $1/8$-exp-concave like in its usual parameterization.