## Saturday, March 28, 2009

### New algorithm: Genetic neighbors

I'm developing a new algorithm that I'm just testing now. I call it "genetic neighbors". It's for use in the netflix prize.

The challenge in the netflix prize is to populate a very large matrix based on 1.5-2% of given values in that matrix. Thus, about 98-98.5% of values need to be interpolated. Some people have resorted to the use of pearson correlation between users for k-nearest neighbor calculation (which in my view is misguided, since it gives a relative correlation between two users, which outside that context does not mean much). This approach attempts to create an absolute ground of reference for correlation calculations.

I was pointed to a link by a friend of mine from (now) the U.S., where it mentioned something about genetic link calculation. This approach basically assumes a set of movie ratings as a type of gene and then tries to match users on that. The neat thing is also that a user need not match a 100%, a user can match 80% with one gene and 50% with another.

The match of the user also has an effect on the formation of that gene, thus, each gene mutates based on the input of different users, according to their match with that gene. The idea is also inspired by incremental "knn", but works very differently, since it's not a relation between users, but a similarity / clustering on an absolute term within this model (the gene). The gene is considered complete in the sense that it has a prediction for each movie within the gene, it is possible to match other users against it wherever they are from. With only very little information, a user can belong to a number of different genes with different weights, such that each gene contributes to the final prediction weighted by its correlation with the user based on historical information. Mistakes will of course be made there, but the more information becomes available, it is expected that predictions become more precise.

Also, I'm experimenting with how this algorithm learns. If a user doesn't have a strong profile (relatively little ratings), then it should not contribute that much to a gene. It's also another challenge (yet again!) to find the correct parameters. I'm also relying on gradient descent for learning the genes.

The difference in this program is also the organization of the rating information. Most programs probably use a type of replicated table in the application in the form of a struct and an array. In this model however, the ratings are allocated under the user structure, which greatly improves the speed in gene calculation and genetic training and also reduces memory requirements.

Doing this cluster analysis, it's also clear to see that the users don't really fit that easily into a cluster. They're sharing some parts of one gene with parts of another gene. Setting up a gene for each individual user would be ideal, but then there wouldn't be any re-use of cluster-specific knowledge of other users.