# Blog

### Posts

## Why FTRL Is Better than Online Mirror Descent

## Summary

Many online learning algorithms come in two flavors: you can either choose the online mirror descent (OMD) version or the follow-the-regularized-leader (FTRL) = dual-averaging version. In practice people usually go for OMD, which makes sense from a theoretical point of view because it can easily adapt to the size of the gradients, i.e. it is Lipschitz-adaptive. This comes with a big downside, however, because OMD may perform (very!) poorly when your intermediate iterates happen to wander away from the optimal parameters for a while, as may easily happen with non-stationary data. FTRL, on the other hand, is much more robust to non-stationarity, but it is not Lipschitz-adaptive by default according to its standard analysis. What should we do? It turns out there are actually two easy fixes to get the best of both worlds, which both build off of FTRL: the first fix is to use a refinement of the standard analysis due to Orabona and Pál to show that FTRL can be made Lipschitz-adaptive after all; and the second fix is to combine FTRL with a technique that we have started calling Cutkosky clipping in my group, after its inventor Ashok Cutkosky. Read on for an overview of the two approaches, which are both worth having in your theoretical toolbox.

Continue reading## Online Learning: How to Filter Out Outliers

## Summary

For the latest version of the MetaGrad algorithm [1], we did some pretty extensive experiments, and this really drove home the point with me that online learning algorithms only work well in practice if they can automatically adapt to the size of the gradients, as opposed to fixing an upper bound \(G\) on the maximum gradient length in advance. This probably also explains the success of AdaGrad [2], AdaHedge [3], etc. But if you do not enforce a bound on the gradient lengths, then you make yourself vulnerable to outliers: a single extreme data point (e.g. with a large gradient) can completely break the online learner. How do we defend against such outliers? In this post, I describe a simple (but optimal) filtering strategy that can be applied to any existing online learning algorithm. This is based on our recent COLT paper [4] with Sarah Sachs, Wouter Koolen and Wojciech Kotłowski. There are also the COLT recorded videos if you prefer.

Continue reading## PAC-Bayes Mini-tutorial: A Continuous Union Bound

## Summary

When I first encountered PAC-Bayesian concentration inequalities they seemed to me to be rather disconnected from good old-fashioned results like Hoeffding’s and Bernstein’s inequalities. But, at least for one flavour of the PAC-Bayesian bounds, there is actually a very close relation, and the main innovation is a continuous version of the union bound, along with some ingenious applications. Here’s the gist of what’s going on, presented from a machine learning perspective.

Continue reading## From Exp-concavity to Mixability

## Summary

In sequential prediction (online learning) with expert advice the goal is to predict a sequence of outcomes almost as well as the best advisor from a pool of experts. The quality of predictions is measured by a loss function, which is determined by the application one has in mind. For most loss functions, the best performance that can be guaranteed is to be within \(O(\sqrt{T})\) of the best expert on \(T\) outcomes, but for some special loss functions \(O(1)\) overhead is possible. In the 1990’s, people familiar with the work of Vovk called these special loss functions mixable losses, but nowadays the notion of mixability appears to be mostly forgotten, and the geometric concept of exp-concavity has taken its place. This raises the question of how the two are related, which strangely does not appear to be answered in very much detail in the literature. As I have been studying mixability quite a bit in my recent work, I was wondering about this, so here are some thoughts. Update: In particular, I will construct a parameterization of the squared loss in which it is \(1/2\)-exp-concave instead of only \(1/8\)-exp-concave like in its usual parameterization.

Continue reading## Large deviations: Cramér vs Sanov

## Summary

I have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent. It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…

Continue reading