Wednesday, March 11, 2009

Improving the position on the leaderboard (up to 884 from 1138)

I've been busy with some algorithm refinements after reading papers on the netflix prize. I've gotten some very gradual results, but tomorrow should come up with some very good improvements after taking out a serious bug :).

It's quite a difficult problem to understand, but with more runs and perceptions of how the runs and data behaves, you get a feel for what to do. My advice to new starters is not to tie down too much to your algorithms. Fiddling with parameters is not going to significantly improve your results. Find ways to understand the data, the crux of the problem is that your algorithm is using a very sparse matrix as a basis to predict values in those cells where no value is present. Since the matrix is only 1.5% or so full, that's not a lot of data to start from.

I'm measuring performance mostly against the probe, not against the test set. The test set is just used to train the features.

So, the difficulty of applying SVD to the netflix prize is that the matrix is extremely sparse, generating loads of inaccuracies, and also that through having such a little amount of values available, the elasticity between values doesn't develop. With elasticity, I mean that you can perceive the original matrix row by row, as similarities between movies, but also column by column, looking at users. In a full matrix, the SVD algorithm develops a much closer prediction ability, because the values allow it to approximate whatever function it is that the matrix would embody. So, one movie that has high ratings for many users could at some point drop significantly for some other "type" of users. The problem is that there's not that many measuring points to find that out.

The harder predictions are those that fall far from the average that the algorithm generally would predict. But since those perceived values are rare, the features do not really embody such strong divergence.

I actually found that the key to solving everything without the use of blending is to do more work up-front, to better approximate the actual function, such that SVD can fill in the rest. The earlier you descend down to the real values (the rmse decreases), the better it is overall.

I've also seen posts from other people where they seem to be able to leave the SVD algorithms run for 1,000 features. In my implementation, I clearly see that there's a point where it develops divergence, so the rmse increases. In my original implementation that was around 32 features, right now it's about 42 features. I've decreased learning rate and improved the parameters and surely I'll post those once I see significant improvements.

Overall, if you're not significantly changing the heart of your algorithm and do not use ridiculous parameters, this point where divergence occurs is more or less the same. But also the improvements in rmse from one feature to the other remains more or less equal. Thus, if you are seeing a pattern of 0.176155, 0.007401, 0.004668 as improvements, if your rmse is lower for another round because you improved parameters or did more work, I've found that the descent is more or less the same, thus reaching an rmse that is as low as the difference in improvement of the first feature.

Using wilder parameters (much lower learning rate or stronger regularization) can make things look really good from the start, but divergence or instability is much more likely to occur after 15-16 features or so. Thus, I'm not tweaking those parameters that much anymore, as the strength of SVD isn't so much in the parameters, it's in how things are learned and what you do before/after. The differences for good values of lrate and k are only relatively marginal.

The learning rate only affects how fast the algorithm learns. But slightly lower rates allow your rmse to descend deeper, because it realizes a better approximation of the function. I have found though that changing K throughout the svd run has some positive results. Apparently, data easily overfits and increasing the value of K gave me some good results up to feature 14, yielding instability afterwards, but that was probably due to hitting the boundaries of its effectiveness.

I've calculated that, if performance really is linear with regards to the starting rmse, I'll need to start with an rmse of around 0.8957 in order to hit the 10% mark. So, by just looking at the first feature, and assuming that this performance is linear, I can easily see the results of my changes to the algorithm.

(EDIT: See results. Still rmse was lowering, so will reproduce another run with some other small tweaks and a larger change). More tomorrow!

No comments: