Wednesday, May 06, 2009

Tikhonov Regularization

This post is about Tikhonov Regularization, which is used for statistical and predictive method. For systems where the predictions are nicely along the real values, like so:

Things are swell, because the predictions are on either side of the line. Basically, you have a good chance of finding a very good solution for this, as it looks like a well-posed problem. The (re)estimations are very regular positive and negative, yielding very good values for a linear regression of your found values, such that you can predict new values which you haven't initially trained for (for which you know the actual value).

Well, the netflix prize is facing an ill-posed problem. This basically means that there are a number of different solutions possible, or there are cases where the differences become so small, that it generates numerical instability when calculating that solution. I've basically seen this happen in some of my implementations when doing the blending.

So, what's done is that a regularization factor is introduced, which is giving penalties to larger factors, thereby conditioning the problem better (reducing the number of solutions). See an example what part of a graph might look like:

Okay, so in this case, we may have a prediction function which is pretty complicated and for a large number of values, which are actually connected to other dimensions, it's looking like this for part of the range and domain. If you'd grab this part of the graph and then try to refit this with different approaches into a better estimation overall, then chances are very high you'd have numerical instability here. Notice how many black points are below the line. The red line (the real values) are way above that. Even though the distance between the point and the line aren't necessary disturbing, when trying to do a regression on this sub-space of the graph, you'd end up with some very bad factors. It may be better to just follow the red line along the bottom a bit, because if you'd regress towards different factors, you'd get the same problem elsewhere along the road.

Now, for a little bit of philosophy, read Occam's razor. This basically states: "The simplest explanation for a phenomenon is most likely the correct explanation".
Other methods for inferring evolutionary relationships use parsimony in a more traditional way. Likelihood methods for phylogeny use parsimony as they do for all likelihood tests, with hypotheses requiring few differing parameters (i.e., numbers of different rates of character change or different frequencies of character state transitions) being treated as null hypotheses relative to hypotheses requiring many differing parameters. Thus, complex hypotheses must predict data much better than do simple hypotheses before researchers reject the simple hypotheses. Recent advances employ information theory, a close cousin of likelihood, which uses Occam's Razor in the same way.
Thus, it's better to use models with fewer parameters than it is to use models that are using many, many different parameters. Find those models which describe the problem well, then merge them together.

Back to Tikhonov... If you're also a contender in the netflix prize, check out the following source, which is heavily chopped down in complexity, improved in readability and derived from the source of linalg, which was developed by gravity. This source is not likely usable without the use of their linalg source codes, since it uses PseudoInverse.cpp for example:

Link to main.cpp

The regfactor and regfactorbias also exist in linalg and those are key towards bringing down the rmse. Remember that we potentially have numerical instability, which is why we want to introduce regularization. The regularization conditions the problem, after which better solutions are found. Better solutions provide better fits, reducing error. You'd prefer these regularization terms as close to 0.0f as possible, because when this reg term goes towards 0.00f, you'll go towards the unregularized solution of least squares regression. It'd be great to get there, but the problem itself is ill-posed, so we need some regularization at least.

I did rip out the fancy parts like cross-validation (used to find optimal values) and rmse reporting, such that you can follow the code better. In the pseudo-inverse, it's basically executing the linear least square solver.

The returned array is an array of weights, which are used to multiply with a bias of 1.0f, another weight for algo1, another for algo2... If you like, you could easily add other algorithms and this thing will give you reasonable weights in return.


Anonymous said...

So this begs the obvious question - for the Netflix data how much improvement is there over a simple linear combination of predictions?

Gerard Toonstra said...

I can't possibly answer that question, because it depends on how your predictions are arranged in the total data set. I assume however that for many prediction sets, they are quite well distributed across the actual ratings, if you consider the entire prediction range. Thus, the problem isn't necessarily ill-posed. In that case, the overall rmse is probably very equal. That is why many people state that there's no difference whatsoever.

I have a set here which without any regularization gets 0.890916. With regularization it gets 0.890897. So it's minimal, but only because the number of samples is so high.

If interested, you can play around with the numbers in the main.cpp. You can see the results for ratings that are generally above or below the actual ratings and their influence on the weights (turn regularization off though).

Tikhonov is certainly no silver bullet. It's very likely that it's mostly effective for situations where there are a small number of samples. 140,000 or 1,400,000 doesn't probably qualify as small, plus that their spread is quite regular.

The best thing to conclude is that there are bad prediction sets which don't contribute to your overall results and there are good prediction sets which do help. Tikhonov may help you to still nicely blend the not-so-good ones, but it definitely isn't a silver bullet that hammers things into place.

Having a number of prediction sets which blend very nicely together with regular normalization is the best way forward, probably.

Gerard Toonstra said...

The following shows that with different regularization settings, you'll get very different results, yet the overall RMSE is equal. The bias 0.0000 in the middle probably yields the solution for a generalized linreg.

0.890898 bias=-0.030049 Ws=[0.295879,0.714400]
0.890897 bias=-0.032216 Ws=[0.294100,0.716759]
0.890916 bias=0.000000 Ws=[0.285848,0.716473]
0.890897 bias=-0.032768 Ws=[0.293651,0.717356]
0.890897 bias=-1.619981 Ws=[0.293563,0.717346]