Why FTRL Is Better than Online Mirror Descent

Posted on

Many online learning algorithms come in two flavors: you can either choose the online mirror descent (OMD) version or the follow-the-regularized-leader (FTRL) = dual-averaging version. In practice people usually go for OMD, which makes sense from a theoretical point of view because it can easily adapt to the size of the gradients, i.e. it is Lipschitz-adaptive. This comes with a big downside, however, because OMD may perform (very!) poorly when your intermediate iterates happen to wander away from the optimal parameters for a while, as may easily happen with non-stationary data. FTRL, on the other hand, is much more robust to non-stationarity, but it is not Lipschitz-adaptive by default according to its standard analysis. What should we do? It turns out there are actually two easy fixes to get the best of both worlds, which both build off of FTRL: the first fix is to use a refinement of the standard analysis due to Orabona and Pál to show that FTRL can be made Lipschitz-adaptive after all; and the second fix is to combine FTRL with a technique that we have started calling Cutkosky clipping in my group, after its inventor Ashok Cutkosky. Read on for an overview of the two approaches, which are both worth having in your theoretical toolbox.

Online Learning: How to Filter Out Outliers

Posted on

For the latest version of the MetaGrad algorithm, we did some pretty extensive experiments, and this really drove home the point with me that online learning algorithms only work well in practice if they can automatically adapt to the size of the gradients, as opposed to fixing an upper bound \(G\) on the maximum gradient length in advance. This probably also explains the success of AdaGrad, AdaHedge, etc. But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.

From Exp-concavity to Mixability

Posted on

In sequential prediction (online learning) with expert advice the goal is to predict a sequence of outcomes almost as well as the best advisor from a pool of experts. The quality of predictions is measured by a loss function, which is determined by the application one has in mind. For most loss functions, the best performance that can be guaranteed is to be within \(O(\sqrt{T})\) of the best expert on \(T\) outcomes, but for some special loss functions \(O(1)\) overhead is possible. In the 1990’s, people familiar with the work of Vovk called these special loss functions mixable losses, but nowadays the notion of mixability appears to be mostly forgotten, and the geometric concept of exp-concavity has taken its place. This raises the question of how the two are related, which strangely does not appear to be answered in very much detail in the literature. As I have been studying mixability quite a bit in my recent work, I was wondering about this, so here are some thoughts. Update: In particular, I will construct a parameterization of the squared loss in which it is \(1/2\)-exp-concave instead of only \(1/8\)-exp-concave like in its usual parameterization.