Showing posts with label netflix prize. Show all posts
Showing posts with label netflix prize. Show all posts

Wednesday, November 11, 2009

Netflix prize revisited

The netflix prize has been over for some months. I ended up 215th in the high score table. Considering my involvement over time, it's probably where I should have ended up. Other groups have been at the prize for three years and rightfully ended up in the top 40.

If you still have a research interest in the Netflix prize, you can still download the data from the UCI Machine Learning Repository. This set also includes the actual ratings that were used to rate the submissions. It also includes the winning submission set.

A very interesting implementation for solving this problem is the PyFlix implementation. This implementation requires the downloaded data, but converts it into a binary database that is mmap-ed later into the process space. That's quite clever, because using mmap you can also dump some index or mmap some index once you need it. Certainly, you shouldn't do this within loops, but it goes to show that for performance computing, using files is generally a much better approach than databases. Of course you need to work on the files to get them to do something for you.

If you look at the Basic Usage of the Trac page of PyFlix, you'll see some examples of the very nice interface that was built on top of the data. Now, researchers often have not so much data to proof their concepts and the interface built on top of the netflix data in such a way is remarkably elegant for looking into similarities and trying out a couple of concepts from the python interpreter directly.

I have found however that a production implementation of svd or any other algorithm isn't truly viable in Python because of the CPU constraints and the overhead in a number of things. One of those things being the following flags (gcc) for the binaries built on my machine:
-Wall -march=core2 -mtune=core2 -O2
These show that the binaries to be built are tuned specifically to my processor that I am running, instead of generic "i386" architectures. I'm also no expert on Python, so there may be many ways to totally optimize python such that it runs much better. The flags above will generate code that is much more efficient and reduces a single cycle of 27 seconds on generic i386 architectures to 7 secs in total.

Although programmers don't worry about memory in general that much, since they don't need to (readibility of code and other quality attributes need to be paid attention to as well), for certain loops that are executed a very large number of times (in the millions), it becomes much more important to focus on the actual performance of that loop. This is a very nice article about memory, written by the maintainer of the glibc library of Linux. The glibc library is the glue between the kernel and your application basically and it has a number of handy low-level utility functions that every application uses (like strlen, strstr, etc.).

One of the important aspects of maintaining performance is trying to sort data (if possible functionally and technically), such that the processor doesn't get a cache miss to acquire the data. It will then be much quicker in those kinds of loops. Another kind of performance hog is the ordering of data, for example:
my2DArray[ ROWS ][ COLS ];
When cycling through this loop, you'll want to do this by rows first, then cycle on the columns. The columns are probably linearly aligned in memory, one after the other. So you'd typically iterate through this array through:
for i = 0 to ROWS:
for j = 0 to COLS:
my2DArray[ i ][ j ]
Compare that to:
for j = 0 to COLS:
for i = 0 to ROWS:
my2DArray[ i ][ j ]
The second has cache misses all over the place, which is very bad for performance. Ideally, for computational tasks you find an algorithm that both keeps data close in the processor cache as long as possible, but of course only if that is feasible.

The implementation of pyflix is sneakily reading from disk and doing quite a bit of things in the background for every iteration of the rating loop. This is severely hurting performance in the long run. The good thing is that there's a very elegant API to access the data for other purposes and this API does include a rather fast index. It's as if a very tiny little specific database engine was written to access the data, which is a remarkable and impressive feat by itself!

Thursday, April 02, 2009

SVD hierarchies disappointing...

I had some good expectations of putting part of the problem into an SVD hierarchy in an attempt to home in on the right ratings sooner by grouping similar users together in what I called genes and then running SVD over that, followed by another detailed SVD run afterwards that would then bring my rmse under 0.9 or so.

Images make things a lot easier to discuss. The image here shows genes, think of them as groups of similar users. The ratings of each user determines the gene that they fall in by calculation of their silhouette and similarity (standard rmse function). There are likely other methods of doing this.

Now, the trick here is that each gene is incomplete, just like the matrix, but because the users are more similar, if you pick up all the users within the gene and then interpolate the rest of its own structure using an external prediction function, the gene becomes complete and becomes more able to classify other users. At least that was the general idea. In practice it doesn't really lead to incredible results.

The problem lies in the density of the information. SVD doesn't need compression or approximation, it likes a lot of information. I reckon that SVD works a lot better once you have a more or less square matrix. 17,700 by 480,000 is a relatively square matrix in SVD perspectives. But the above isn't 480,000, it become 20 x 17,700. And therein lies the rub. The information is so much condensed across the users, that it's not very good for calculating individual numbers for SVD.

It did start me thinking on problems of asymmetry. I was considering that if there are 20 columns versus 17,700 rows, then the problem is that the columns are gaining way too much credit for each learning run (assuming the lrate and K for both are equal). That is why I think square matrices are more suited for SVD. So I also tested out some inequal numbers, just to see what would happen. I wanted the movie factors to gain more information than the gene factors, because for each rating that I'd process (100,000,000), the movie factors would get hit about 5,000 times vs. 5,000,000 for the genes. That's a factor 1, 000. So I tried setting up the learning rate and regularization factor accordingly. Movies would learn 1,000 times faster than genes, still unstable. Possibly, it should both have a very slow learning rate to be successful.

Most of these numbers are empirically determined throug experimentation and have been suggested by others as most successful. Trying slightly different numbers for movies/users doesn't really help much, the overall end result is the same or worse. So best to stick with what you have. Of course, the regularization factor K is related to the choice of learning rate, and I think it's also related to the size that the factors get. In the previous examples, I noticed that the regularization factors are about 8 times the learning rate for the svd factors, but times 50 for the bias learning rate.

So it's more or less back to the drawing table. Basically, I've noticed that SVD works much better if you use a flat starting rate, for example 3.6033 or 3.6043. Then work from there. If you initialize to the movie average, it doesn't work that well. It does make sense in a way, because using an absolute base to work from creates a stable point of reference for the entire matrix.

The best thing to do is generate my new results, open up my error analyzer and find out what else to do :). It is quite possible that there simply isn't a better solution for a single / combined algorithm other than blending 5,000 results together just to get higher on the leaderboard :).

Tuesday, March 31, 2009

Genetic neighbors are like k-medoids

Well, I've found some new info on the Internet whilst trying to refine my algorithms here. I'm still in the process of aligning the proper techniques to be able to run a prediction through. The next one isn't optimal still, but should carry along a couple of good improvements.

The technique I "discovered" actually already existed. That's very common, so no problem there. The application and context in which I'm using it though is very different.

In k-medoids and knn, the trick is to select a centroid or "default" that will act as the center of a determined neighborhood. In KNN, generally parameters are averaged out, as such developing an idea of a class of a certain kind that you either match or not. In k-medoids, the trick is to find out which member is the best representative of the entire set (so no averaging there). This is done by finding out the cost of swapping the center to another member.

This thing above is a measure for similarity, otherwise known as "silhouette" in k-medoids. The silhouette provides information how similar a certain data point is in comparison to its default (in my case, the gene). a(x) is the similarity of the data point to the currently allocated gene, b(x) is the similarity of the data point to the next-in-line highest similar cluster. Thus, if there is a very good match with A, but a not so good match with any other cluster, then the silhouette will tend towards 1. If the other cluster is better (doesn't happen in my case), it tends towards -1. 0 is indifferent between A and B.

Now, the trick is to initialize the clusters to some values. I'm still considering how to do this to a better ability, but I'm also considering it may not matter too much in the beginning. It's optimization. Then, set the ratings for each gene to the exact ratings of the training data of the "default" member that was selected. This probably works better when you select defaults that have a high vote count.

Then, run an algorithm that determines similarity. I'm using ((1/rmse)/sumRmse) as a dirty trick for now. It seems to work well. Then record the similarity for each cluster on each customer (so we are classifying customers here for now). Movies can also be clustered in similar ways, likely. And in that case, one could determine biases between user and movie clusters or even use a different matrix factorization module to svd between those as a "global average". Possibly the mf-values for each user / movie become lower in the second layer and hopefully, through the help of clustering, noise is reduced and the mf values for each individual user become more meaningful and less of an average.

What I'm actually doing after allocating users into clusters / genes, is that I'm using that information to complete the matrix. Not user X <-> movie Y as one big matrix, but gene X <-> movie Y. If the users are really more similar, then the error in movie average hopefully goes down if the clusters are properly populated. The movie average is then used as a focal point for further calculations.

You obviously can't use "average" values any longer, because that's already set by the global movie average, unbiased and over a single gene you could say, the entire population. Thus, we need a new, external algorithm that is able to provide those numbers. Global effects could do for now, or perhaps a matrix factorization method.

The "silhouette" calculated earlier for each user then becomes useful. If the silhouette is a measure for how "powerful" a certain user falls into a gene, then we can set a threshold which users actually contribute to that gene and which don't. If you imagine a borderline between two clusters and a lot of users on that "front", then if one doesn't filter those out, those users would pull the clusters closer together, causing frequent swaps from one cluster to the other, making it less stable (I suppose, this isn't lab tested :).

In initial tests, I came up with 10 clusters and that worked well. Calculation time then is about 20 secs or so per cycle. After 10 cycles, there's just noise generating differences, but no apparent useful changes in rmse. This was before my thresholding algorithm.

After thresholding, the rmse drops quite quickly and also stays rather fixed after about 8 cycles. This may be good or bad, I don't know. I probably need some further tests to see what happens after 64 cycles... Does it then become worse? better?

The genes are calculated with a rather discrete formula, there is no learning involved there. If there is actual training data available for a gene, that is used for averaging out, but only if the threshold for that user in the gene is high enough (it's a measure of representation of that user for that gene). Choosing an incorrect threshold kills your changes and flexbility, too low and there's instability and continuous change.

The gene contains floats, and these floats are "averages", and there are as many floats as there are movies. If any real rates were found for a movie in the gene, then those are used. If no ratings were found for a movie, then the customers within that cluster are used to make predictions (if they have high enough threshold), and those predictions are averaged.

Using this method and selecting k=24 or so, I see the rmse going down to 0.957677 or so after 10 cycles. But the changes and improvements quickly become entirely level after that. Increasing the clusters significantly increases computation time. k=10 is reasonable, k=32 is already pretty gruesome to wait for.

There is a possibility that the initiation of the clusters is horribly inaccurate and incomplete. So it's possible it needs a couple of training cycles to get to a state that it is happy with. I'm not yet sure about the movement and progress of the numbers in this case.

I'm now trying to get better results by combining this method with matrix factorization in specific groups and between groups. Thinking about it... splitting up very large results into groups that are better similar towards another as compared to the global set, it makes sense to investigate more a hierarchical ordering of movies and customers and calculate between those groups as well. However, care must be taken not to make the groups too small or too large. Too large doesn't make the group distinct enough, too small doesn't generate the momentum that the numbers need to point in the right direction :).

Before, I tried to apply the similarity rating to a certain cluster in calculations, but it is woefully inaccurate. It's not really any use taking 20% of one cluster and 80% of another, the results are mediocre. What is much more telling however is the silhouette rating. That rating is much more useful for a variety of purposes, since it is a measure how deeply allocated a user is in a certain gene.

Oh... and today my new memory arrived. Super duper 4GB to run multiple instances next to one another. Although I don't have a quad yet, so it can only effectively be 2. And beyond what I mentioned in this post, there are a multitude of other improvements I am planning before my next submission.

Saturday, March 28, 2009

New algorithm: Genetic neighbors

I'm developing a new algorithm that I'm just testing now. I call it "genetic neighbors". It's for use in the netflix prize.

The challenge in the netflix prize is to populate a very large matrix based on 1.5-2% of given values in that matrix. Thus, about 98-98.5% of values need to be interpolated. Some people have resorted to the use of pearson correlation between users for k-nearest neighbor calculation (which in my view is misguided, since it gives a relative correlation between two users, which outside that context does not mean much). This approach attempts to create an absolute ground of reference for correlation calculations.

I was pointed to a link by a friend of mine from (now) the U.S., where it mentioned something about genetic link calculation. This approach basically assumes a set of movie ratings as a type of gene and then tries to match users on that. The neat thing is also that a user need not match a 100%, a user can match 80% with one gene and 50% with another.

The match of the user also has an effect on the formation of that gene, thus, each gene mutates based on the input of different users, according to their match with that gene. The idea is also inspired by incremental "knn", but works very differently, since it's not a relation between users, but a similarity / clustering on an absolute term within this model (the gene). The gene is considered complete in the sense that it has a prediction for each movie within the gene, it is possible to match other users against it wherever they are from. With only very little information, a user can belong to a number of different genes with different weights, such that each gene contributes to the final prediction weighted by its correlation with the user based on historical information. Mistakes will of course be made there, but the more information becomes available, it is expected that predictions become more precise.

Also, I'm experimenting with how this algorithm learns. If a user doesn't have a strong profile (relatively little ratings), then it should not contribute that much to a gene. It's also another challenge (yet again!) to find the correct parameters. I'm also relying on gradient descent for learning the genes.

The difference in this program is also the organization of the rating information. Most programs probably use a type of replicated table in the application in the form of a struct and an array. In this model however, the ratings are allocated under the user structure, which greatly improves the speed in gene calculation and genetic training and also reduces memory requirements.

Doing this cluster analysis, it's also clear to see that the users don't really fit that easily into a cluster. They're sharing some parts of one gene with parts of another gene. Setting up a gene for each individual user would be ideal, but then there wouldn't be any re-use of cluster-specific knowledge of other users.

Thursday, March 19, 2009

More netflix...

In the calculation run tonight, I ran the entire algorithm with a couple of different factors. I did not manage to significantly improve my ranking, but I did manage to get closer to the probe rating by training over the entire data-set, including the probe. I'm now researching at which point data overfitting starts to occur. I reckon it is related to the learning rate and I reckon it is about after ( 50 cycles + current-feature-num ).

If you only react to the probe rmse improvements to determine to switch to the next feature, this may never become apparent. I've run a wide number of different configurations, always using probe rmse as early stopping method (once it stops improving), but now I reckon that it may be too late. I'm now looking at early stopping once the hill has been reached, but using a simple calculation (as above). I'll probably post about those results in the future.

I've noticed that as time progresses, the point of highest improvement moves backwards and you need more epochs to get there. Thus, the first feature shows improvement immediately after the first epoch, but the 10th for example only shows improvement after 16-20 epochs, also "stopping" 16-20 epochs later than the first.

I've tried squeezing out as much rmse improvement as possible by using different parameters and global effect reductions, but this results in a flat line of no more improvement after about 32 features. Actually, applying the movie/user bias global effect worsens the result significantly, so I've turned that off. Instead, I'm relying on the bias training together with the features for the moment.

Using K = 0.03f or K = 0.01f also doesn't have any effect in the long run. Actually, the weird thing is that the probe rmse for the first feature isn't really affected by it at all. Also, in the long run I'm seeing the same numbers come back.

I've also tried gradually modifying K and lrate and neither did that have any effect. Note that in all these experiments, the probe rmse is the leading thing to determine to continue or not. I used to rely on a standard MAX_EPOCH setting that switched to the next feature, where I did get differences. The following sketch is instrumental in my explanation:


The graph shows how more or less the improvements build up and decay. The x-axis shows the epochs, the y is the probe rmse improvement. The vertical dashed line is the cut-off point where training stops. Thus, it shows that whenever training is stopped at the cut-off point, it is irrelevant what the parameters are, since the area doesn't seem to change much. I should have drawn the smaller bell-curve higher, so that the surface area under it makes it clear they have the same gain in total (which in my setup now is the case).

The learning rate squeezes the bell-curve together (allowing earlier stopping), but it also makes the total result less accurate (since overshoot may occur). The K has a similar effect to learning as it holds back (regularizes) the learning. But overall it should not matter too much.

So my approach now is to find out at which point before the cut-off it is best to stop learning and move on to the next feature. My guess is that over-fitting starts occurring right after the training has passed the hill. By aligning my parameters, I'm going to try to put that hill on a predictable point, such that I can simply set a maximum epoch and then move on to the next.