Tuesday, March 31, 2009

Genetic neighbors are like k-medoids

Well, I've found some new info on the Internet whilst trying to refine my algorithms here. I'm still in the process of aligning the proper techniques to be able to run a prediction through. The next one isn't optimal still, but should carry along a couple of good improvements.

The technique I "discovered" actually already existed. That's very common, so no problem there. The application and context in which I'm using it though is very different.

In k-medoids and knn, the trick is to select a centroid or "default" that will act as the center of a determined neighborhood. In KNN, generally parameters are averaged out, as such developing an idea of a class of a certain kind that you either match or not. In k-medoids, the trick is to find out which member is the best representative of the entire set (so no averaging there). This is done by finding out the cost of swapping the center to another member.

This thing above is a measure for similarity, otherwise known as "silhouette" in k-medoids. The silhouette provides information how similar a certain data point is in comparison to its default (in my case, the gene). a(x) is the similarity of the data point to the currently allocated gene, b(x) is the similarity of the data point to the next-in-line highest similar cluster. Thus, if there is a very good match with A, but a not so good match with any other cluster, then the silhouette will tend towards 1. If the other cluster is better (doesn't happen in my case), it tends towards -1. 0 is indifferent between A and B.

Now, the trick is to initialize the clusters to some values. I'm still considering how to do this to a better ability, but I'm also considering it may not matter too much in the beginning. It's optimization. Then, set the ratings for each gene to the exact ratings of the training data of the "default" member that was selected. This probably works better when you select defaults that have a high vote count.

Then, run an algorithm that determines similarity. I'm using ((1/rmse)/sumRmse) as a dirty trick for now. It seems to work well. Then record the similarity for each cluster on each customer (so we are classifying customers here for now). Movies can also be clustered in similar ways, likely. And in that case, one could determine biases between user and movie clusters or even use a different matrix factorization module to svd between those as a "global average". Possibly the mf-values for each user / movie become lower in the second layer and hopefully, through the help of clustering, noise is reduced and the mf values for each individual user become more meaningful and less of an average.

What I'm actually doing after allocating users into clusters / genes, is that I'm using that information to complete the matrix. Not user X <-> movie Y as one big matrix, but gene X <-> movie Y. If the users are really more similar, then the error in movie average hopefully goes down if the clusters are properly populated. The movie average is then used as a focal point for further calculations.

You obviously can't use "average" values any longer, because that's already set by the global movie average, unbiased and over a single gene you could say, the entire population. Thus, we need a new, external algorithm that is able to provide those numbers. Global effects could do for now, or perhaps a matrix factorization method.

The "silhouette" calculated earlier for each user then becomes useful. If the silhouette is a measure for how "powerful" a certain user falls into a gene, then we can set a threshold which users actually contribute to that gene and which don't. If you imagine a borderline between two clusters and a lot of users on that "front", then if one doesn't filter those out, those users would pull the clusters closer together, causing frequent swaps from one cluster to the other, making it less stable (I suppose, this isn't lab tested :).

In initial tests, I came up with 10 clusters and that worked well. Calculation time then is about 20 secs or so per cycle. After 10 cycles, there's just noise generating differences, but no apparent useful changes in rmse. This was before my thresholding algorithm.

After thresholding, the rmse drops quite quickly and also stays rather fixed after about 8 cycles. This may be good or bad, I don't know. I probably need some further tests to see what happens after 64 cycles... Does it then become worse? better?

The genes are calculated with a rather discrete formula, there is no learning involved there. If there is actual training data available for a gene, that is used for averaging out, but only if the threshold for that user in the gene is high enough (it's a measure of representation of that user for that gene). Choosing an incorrect threshold kills your changes and flexbility, too low and there's instability and continuous change.

The gene contains floats, and these floats are "averages", and there are as many floats as there are movies. If any real rates were found for a movie in the gene, then those are used. If no ratings were found for a movie, then the customers within that cluster are used to make predictions (if they have high enough threshold), and those predictions are averaged.

Using this method and selecting k=24 or so, I see the rmse going down to 0.957677 or so after 10 cycles. But the changes and improvements quickly become entirely level after that. Increasing the clusters significantly increases computation time. k=10 is reasonable, k=32 is already pretty gruesome to wait for.

There is a possibility that the initiation of the clusters is horribly inaccurate and incomplete. So it's possible it needs a couple of training cycles to get to a state that it is happy with. I'm not yet sure about the movement and progress of the numbers in this case.

I'm now trying to get better results by combining this method with matrix factorization in specific groups and between groups. Thinking about it... splitting up very large results into groups that are better similar towards another as compared to the global set, it makes sense to investigate more a hierarchical ordering of movies and customers and calculate between those groups as well. However, care must be taken not to make the groups too small or too large. Too large doesn't make the group distinct enough, too small doesn't generate the momentum that the numbers need to point in the right direction :).

Before, I tried to apply the similarity rating to a certain cluster in calculations, but it is woefully inaccurate. It's not really any use taking 20% of one cluster and 80% of another, the results are mediocre. What is much more telling however is the silhouette rating. That rating is much more useful for a variety of purposes, since it is a measure how deeply allocated a user is in a certain gene.

Oh... and today my new memory arrived. Super duper 4GB to run multiple instances next to one another. Although I don't have a quad yet, so it can only effectively be 2. And beyond what I mentioned in this post, there are a multitude of other improvements I am planning before my next submission.

No comments: