Thursday, April 02, 2009

SVD hierarchies disappointing...

I had some good expectations of putting part of the problem into an SVD hierarchy in an attempt to home in on the right ratings sooner by grouping similar users together in what I called genes and then running SVD over that, followed by another detailed SVD run afterwards that would then bring my rmse under 0.9 or so.

Images make things a lot easier to discuss. The image here shows genes, think of them as groups of similar users. The ratings of each user determines the gene that they fall in by calculation of their silhouette and similarity (standard rmse function). There are likely other methods of doing this.

Now, the trick here is that each gene is incomplete, just like the matrix, but because the users are more similar, if you pick up all the users within the gene and then interpolate the rest of its own structure using an external prediction function, the gene becomes complete and becomes more able to classify other users. At least that was the general idea. In practice it doesn't really lead to incredible results.

The problem lies in the density of the information. SVD doesn't need compression or approximation, it likes a lot of information. I reckon that SVD works a lot better once you have a more or less square matrix. 17,700 by 480,000 is a relatively square matrix in SVD perspectives. But the above isn't 480,000, it become 20 x 17,700. And therein lies the rub. The information is so much condensed across the users, that it's not very good for calculating individual numbers for SVD.

It did start me thinking on problems of asymmetry. I was considering that if there are 20 columns versus 17,700 rows, then the problem is that the columns are gaining way too much credit for each learning run (assuming the lrate and K for both are equal). That is why I think square matrices are more suited for SVD. So I also tested out some inequal numbers, just to see what would happen. I wanted the movie factors to gain more information than the gene factors, because for each rating that I'd process (100,000,000), the movie factors would get hit about 5,000 times vs. 5,000,000 for the genes. That's a factor 1, 000. So I tried setting up the learning rate and regularization factor accordingly. Movies would learn 1,000 times faster than genes, still unstable. Possibly, it should both have a very slow learning rate to be successful.

Most of these numbers are empirically determined throug experimentation and have been suggested by others as most successful. Trying slightly different numbers for movies/users doesn't really help much, the overall end result is the same or worse. So best to stick with what you have. Of course, the regularization factor K is related to the choice of learning rate, and I think it's also related to the size that the factors get. In the previous examples, I noticed that the regularization factors are about 8 times the learning rate for the svd factors, but times 50 for the bias learning rate.

So it's more or less back to the drawing table. Basically, I've noticed that SVD works much better if you use a flat starting rate, for example 3.6033 or 3.6043. Then work from there. If you initialize to the movie average, it doesn't work that well. It does make sense in a way, because using an absolute base to work from creates a stable point of reference for the entire matrix.

The best thing to do is generate my new results, open up my error analyzer and find out what else to do :). It is quite possible that there simply isn't a better solution for a single / combined algorithm other than blending 5,000 results together just to get higher on the leaderboard :).

No comments: