Tuesday, March 31, 2009

Genetic neighbors are like k-medoids

Well, I've found some new info on the Internet whilst trying to refine my algorithms here. I'm still in the process of aligning the proper techniques to be able to run a prediction through. The next one isn't optimal still, but should carry along a couple of good improvements.

The technique I "discovered" actually already existed. That's very common, so no problem there. The application and context in which I'm using it though is very different.

In k-medoids and knn, the trick is to select a centroid or "default" that will act as the center of a determined neighborhood. In KNN, generally parameters are averaged out, as such developing an idea of a class of a certain kind that you either match or not. In k-medoids, the trick is to find out which member is the best representative of the entire set (so no averaging there). This is done by finding out the cost of swapping the center to another member.

This thing above is a measure for similarity, otherwise known as "silhouette" in k-medoids. The silhouette provides information how similar a certain data point is in comparison to its default (in my case, the gene). a(x) is the similarity of the data point to the currently allocated gene, b(x) is the similarity of the data point to the next-in-line highest similar cluster. Thus, if there is a very good match with A, but a not so good match with any other cluster, then the silhouette will tend towards 1. If the other cluster is better (doesn't happen in my case), it tends towards -1. 0 is indifferent between A and B.

Now, the trick is to initialize the clusters to some values. I'm still considering how to do this to a better ability, but I'm also considering it may not matter too much in the beginning. It's optimization. Then, set the ratings for each gene to the exact ratings of the training data of the "default" member that was selected. This probably works better when you select defaults that have a high vote count.

Then, run an algorithm that determines similarity. I'm using ((1/rmse)/sumRmse) as a dirty trick for now. It seems to work well. Then record the similarity for each cluster on each customer (so we are classifying customers here for now). Movies can also be clustered in similar ways, likely. And in that case, one could determine biases between user and movie clusters or even use a different matrix factorization module to svd between those as a "global average". Possibly the mf-values for each user / movie become lower in the second layer and hopefully, through the help of clustering, noise is reduced and the mf values for each individual user become more meaningful and less of an average.

What I'm actually doing after allocating users into clusters / genes, is that I'm using that information to complete the matrix. Not user X <-> movie Y as one big matrix, but gene X <-> movie Y. If the users are really more similar, then the error in movie average hopefully goes down if the clusters are properly populated. The movie average is then used as a focal point for further calculations.

You obviously can't use "average" values any longer, because that's already set by the global movie average, unbiased and over a single gene you could say, the entire population. Thus, we need a new, external algorithm that is able to provide those numbers. Global effects could do for now, or perhaps a matrix factorization method.

The "silhouette" calculated earlier for each user then becomes useful. If the silhouette is a measure for how "powerful" a certain user falls into a gene, then we can set a threshold which users actually contribute to that gene and which don't. If you imagine a borderline between two clusters and a lot of users on that "front", then if one doesn't filter those out, those users would pull the clusters closer together, causing frequent swaps from one cluster to the other, making it less stable (I suppose, this isn't lab tested :).

In initial tests, I came up with 10 clusters and that worked well. Calculation time then is about 20 secs or so per cycle. After 10 cycles, there's just noise generating differences, but no apparent useful changes in rmse. This was before my thresholding algorithm.

After thresholding, the rmse drops quite quickly and also stays rather fixed after about 8 cycles. This may be good or bad, I don't know. I probably need some further tests to see what happens after 64 cycles... Does it then become worse? better?

The genes are calculated with a rather discrete formula, there is no learning involved there. If there is actual training data available for a gene, that is used for averaging out, but only if the threshold for that user in the gene is high enough (it's a measure of representation of that user for that gene). Choosing an incorrect threshold kills your changes and flexbility, too low and there's instability and continuous change.

The gene contains floats, and these floats are "averages", and there are as many floats as there are movies. If any real rates were found for a movie in the gene, then those are used. If no ratings were found for a movie, then the customers within that cluster are used to make predictions (if they have high enough threshold), and those predictions are averaged.

Using this method and selecting k=24 or so, I see the rmse going down to 0.957677 or so after 10 cycles. But the changes and improvements quickly become entirely level after that. Increasing the clusters significantly increases computation time. k=10 is reasonable, k=32 is already pretty gruesome to wait for.

There is a possibility that the initiation of the clusters is horribly inaccurate and incomplete. So it's possible it needs a couple of training cycles to get to a state that it is happy with. I'm not yet sure about the movement and progress of the numbers in this case.

I'm now trying to get better results by combining this method with matrix factorization in specific groups and between groups. Thinking about it... splitting up very large results into groups that are better similar towards another as compared to the global set, it makes sense to investigate more a hierarchical ordering of movies and customers and calculate between those groups as well. However, care must be taken not to make the groups too small or too large. Too large doesn't make the group distinct enough, too small doesn't generate the momentum that the numbers need to point in the right direction :).

Before, I tried to apply the similarity rating to a certain cluster in calculations, but it is woefully inaccurate. It's not really any use taking 20% of one cluster and 80% of another, the results are mediocre. What is much more telling however is the silhouette rating. That rating is much more useful for a variety of purposes, since it is a measure how deeply allocated a user is in a certain gene.

Oh... and today my new memory arrived. Super duper 4GB to run multiple instances next to one another. Although I don't have a quad yet, so it can only effectively be 2. And beyond what I mentioned in this post, there are a multitude of other improvements I am planning before my next submission.

Saturday, March 28, 2009

New algorithm: Genetic neighbors

I'm developing a new algorithm that I'm just testing now. I call it "genetic neighbors". It's for use in the netflix prize.

The challenge in the netflix prize is to populate a very large matrix based on 1.5-2% of given values in that matrix. Thus, about 98-98.5% of values need to be interpolated. Some people have resorted to the use of pearson correlation between users for k-nearest neighbor calculation (which in my view is misguided, since it gives a relative correlation between two users, which outside that context does not mean much). This approach attempts to create an absolute ground of reference for correlation calculations.

I was pointed to a link by a friend of mine from (now) the U.S., where it mentioned something about genetic link calculation. This approach basically assumes a set of movie ratings as a type of gene and then tries to match users on that. The neat thing is also that a user need not match a 100%, a user can match 80% with one gene and 50% with another.

The match of the user also has an effect on the formation of that gene, thus, each gene mutates based on the input of different users, according to their match with that gene. The idea is also inspired by incremental "knn", but works very differently, since it's not a relation between users, but a similarity / clustering on an absolute term within this model (the gene). The gene is considered complete in the sense that it has a prediction for each movie within the gene, it is possible to match other users against it wherever they are from. With only very little information, a user can belong to a number of different genes with different weights, such that each gene contributes to the final prediction weighted by its correlation with the user based on historical information. Mistakes will of course be made there, but the more information becomes available, it is expected that predictions become more precise.

Also, I'm experimenting with how this algorithm learns. If a user doesn't have a strong profile (relatively little ratings), then it should not contribute that much to a gene. It's also another challenge (yet again!) to find the correct parameters. I'm also relying on gradient descent for learning the genes.

The difference in this program is also the organization of the rating information. Most programs probably use a type of replicated table in the application in the form of a struct and an array. In this model however, the ratings are allocated under the user structure, which greatly improves the speed in gene calculation and genetic training and also reduces memory requirements.

Doing this cluster analysis, it's also clear to see that the users don't really fit that easily into a cluster. They're sharing some parts of one gene with parts of another gene. Setting up a gene for each individual user would be ideal, but then there wouldn't be any re-use of cluster-specific knowledge of other users.

Thursday, March 26, 2009

Something to watch: FACETS

FACETS is probably one of the most interesting research projects that's taking place right now. Well, together with the LHC :). Previously, I posted on the architecture of mind. That post resembles the envisioning of a totally different kind of hardware as it would apply to (re)modeling the human brain. Well, the people at heidelberg uni are now doing this. They're constructing hardware that uses a radically different design than micro-processors. From their site:
By creating specialized digital hardware processors it might be possible to gain an advantage over microprocessor-based systems. Still, it is unlikely that this will be more than an order of magnitude, since they are based on the same technology as microprocessors: The neural circuits remain to be realised with numerical solutions of differential equations. The biggest problem lies in the fundamentals of Moore's law itself: the scaling of process technology. In the current semiconductor roadmap the progress has already slowed down. The transistor density of high-performance microprocessors is likely to increase only by a factor of 25 from 2004 to 2018. A power consumption of 300 Watts is predicted for such a hypothetical chip while the on-chip operating frequency will be in the 50 GHz range.
I've blogged before on consciousness and how this might relate to computers, and what a design could look like. And some more. And more.

Changing the hardware is extremely important and much more likely to become successful.

A large problem is still the coordination between things and the problems related to associative memory. That is, when we say "cat", we instantly recall associations related to the word, strongest first (black? kitten? dead mouse?).

Most explicit knowledge systems scan their entire memory base, or have otherwise explicitly defined boundaries around knowledge to hierarchically exclude certain pools of knowledge programmatically. Thus, a key pointing to some piece of information is not recognized as such and doesn't cause a specific part of memory to highlight. It's requiring a sweep of memory to see with which asset of knowledge it's associated.

In order to be successful in the future, I think it's necessary to find ways to prevent that, to directly find some location/pool/hierarchy where something is probably located, such that it finds a match or has the ability to locate it together with other like members.

Saturday, March 21, 2009

Early morning

early morning
Originally uploaded by gtoonstra
Last week I purchased a digital SLR camera for starting to make some proper photo's. I got up early today to see how I'd do, how the camera performs and try to make some good shots. Here's one of my favs, even though the sky is horribly overlit. It's a little bit eery, somewhat mysterious. Sharpness is quite ok, could have been better with a tripod. Good division of fore, middle and background and everything.

Thursday, March 19, 2009

More netflix...

In the calculation run tonight, I ran the entire algorithm with a couple of different factors. I did not manage to significantly improve my ranking, but I did manage to get closer to the probe rating by training over the entire data-set, including the probe. I'm now researching at which point data overfitting starts to occur. I reckon it is related to the learning rate and I reckon it is about after ( 50 cycles + current-feature-num ).

If you only react to the probe rmse improvements to determine to switch to the next feature, this may never become apparent. I've run a wide number of different configurations, always using probe rmse as early stopping method (once it stops improving), but now I reckon that it may be too late. I'm now looking at early stopping once the hill has been reached, but using a simple calculation (as above). I'll probably post about those results in the future.

I've noticed that as time progresses, the point of highest improvement moves backwards and you need more epochs to get there. Thus, the first feature shows improvement immediately after the first epoch, but the 10th for example only shows improvement after 16-20 epochs, also "stopping" 16-20 epochs later than the first.

I've tried squeezing out as much rmse improvement as possible by using different parameters and global effect reductions, but this results in a flat line of no more improvement after about 32 features. Actually, applying the movie/user bias global effect worsens the result significantly, so I've turned that off. Instead, I'm relying on the bias training together with the features for the moment.

Using K = 0.03f or K = 0.01f also doesn't have any effect in the long run. Actually, the weird thing is that the probe rmse for the first feature isn't really affected by it at all. Also, in the long run I'm seeing the same numbers come back.

I've also tried gradually modifying K and lrate and neither did that have any effect. Note that in all these experiments, the probe rmse is the leading thing to determine to continue or not. I used to rely on a standard MAX_EPOCH setting that switched to the next feature, where I did get differences. The following sketch is instrumental in my explanation:

The graph shows how more or less the improvements build up and decay. The x-axis shows the epochs, the y is the probe rmse improvement. The vertical dashed line is the cut-off point where training stops. Thus, it shows that whenever training is stopped at the cut-off point, it is irrelevant what the parameters are, since the area doesn't seem to change much. I should have drawn the smaller bell-curve higher, so that the surface area under it makes it clear they have the same gain in total (which in my setup now is the case).

The learning rate squeezes the bell-curve together (allowing earlier stopping), but it also makes the total result less accurate (since overshoot may occur). The K has a similar effect to learning as it holds back (regularizes) the learning. But overall it should not matter too much.

So my approach now is to find out at which point before the cut-off it is best to stop learning and move on to the next feature. My guess is that over-fitting starts occurring right after the training has passed the hill. By aligning my parameters, I'm going to try to put that hill on a predictable point, such that I can simply set a maximum epoch and then move on to the next.

Tuesday, March 17, 2009

Al go rhythms

Mr. Larry Freeman, see link, is a software engineer from Fremont, who has taken the time to explain the "global effects removal" technique sometimes used by contenders in the Netflix prize and reflects on a number of mathematical issues. It's quite interesting for non-academic people, since it's bite-sized pieces.

I'm already grateful for explaining and demonstrating the global effect removal technique. I've tried putting it into my implementation, but sadly could not significantly improve my ratings. As you may guess, I'm using RSVD together with bias determination รก la Funk and Paterek. I'm not using any blending whatsoever.

One of the reasons it doesn't work so well is probably that my implementation already covers biases by training biases along with svd factors in the first phase. Thus, the benefit of it is greatly reduced. I do notice however that using the technique requires my factors to be severely changed. I had lambda=0.05f as suggested by Paterek for calculating the biases, but now that these have been factored out, I reckon they have to be a lot higher to prevent over-fitting.

From the first 3 iterations in the svd algorithm, I can tell how things will turn out towards the end, unless my parameters are really wrong. I've seen a very good first rate at 0.951 (lower than the netflix start average ), which only ended up in a disappointing end rate of 0.9120 (higher than my ultimate post of 0.9114).

Thing that's interesting to research is to factor in global effects into the training phase. I imagine that I'm getting better results with biases, as opposed to global effects pre-processing. That's probably because all global effects are global averages, whereas some kind of gradient descent training may be able to discern the real features and the way they influence (or seem to influence) specific people. Those features may be different from the ones identified for global features, but one thing I'd like to definitely do is use the date more. Consider ratings made on a certain day, find out how much they differed and then make predictions on the change of the date.

This is not related to the production date of the movie versus the date of rating (I think that's very difficult and probably there's no causal relationship between them). One of the tricks in this netflix competition is to find causal relationships, as they make huge differences in the outcome. (in another way, you could say that if a business rule could be developed based on a certain observation that's mostly true, then that's great discovery!).

Wednesday, March 11, 2009

Improving the position on the leaderboard (up to 884 from 1138)

I've been busy with some algorithm refinements after reading papers on the netflix prize. I've gotten some very gradual results, but tomorrow should come up with some very good improvements after taking out a serious bug :).

It's quite a difficult problem to understand, but with more runs and perceptions of how the runs and data behaves, you get a feel for what to do. My advice to new starters is not to tie down too much to your algorithms. Fiddling with parameters is not going to significantly improve your results. Find ways to understand the data, the crux of the problem is that your algorithm is using a very sparse matrix as a basis to predict values in those cells where no value is present. Since the matrix is only 1.5% or so full, that's not a lot of data to start from.

I'm measuring performance mostly against the probe, not against the test set. The test set is just used to train the features.

So, the difficulty of applying SVD to the netflix prize is that the matrix is extremely sparse, generating loads of inaccuracies, and also that through having such a little amount of values available, the elasticity between values doesn't develop. With elasticity, I mean that you can perceive the original matrix row by row, as similarities between movies, but also column by column, looking at users. In a full matrix, the SVD algorithm develops a much closer prediction ability, because the values allow it to approximate whatever function it is that the matrix would embody. So, one movie that has high ratings for many users could at some point drop significantly for some other "type" of users. The problem is that there's not that many measuring points to find that out.

The harder predictions are those that fall far from the average that the algorithm generally would predict. But since those perceived values are rare, the features do not really embody such strong divergence.

I actually found that the key to solving everything without the use of blending is to do more work up-front, to better approximate the actual function, such that SVD can fill in the rest. The earlier you descend down to the real values (the rmse decreases), the better it is overall.

I've also seen posts from other people where they seem to be able to leave the SVD algorithms run for 1,000 features. In my implementation, I clearly see that there's a point where it develops divergence, so the rmse increases. In my original implementation that was around 32 features, right now it's about 42 features. I've decreased learning rate and improved the parameters and surely I'll post those once I see significant improvements.

Overall, if you're not significantly changing the heart of your algorithm and do not use ridiculous parameters, this point where divergence occurs is more or less the same. But also the improvements in rmse from one feature to the other remains more or less equal. Thus, if you are seeing a pattern of 0.176155, 0.007401, 0.004668 as improvements, if your rmse is lower for another round because you improved parameters or did more work, I've found that the descent is more or less the same, thus reaching an rmse that is as low as the difference in improvement of the first feature.

Using wilder parameters (much lower learning rate or stronger regularization) can make things look really good from the start, but divergence or instability is much more likely to occur after 15-16 features or so. Thus, I'm not tweaking those parameters that much anymore, as the strength of SVD isn't so much in the parameters, it's in how things are learned and what you do before/after. The differences for good values of lrate and k are only relatively marginal.

The learning rate only affects how fast the algorithm learns. But slightly lower rates allow your rmse to descend deeper, because it realizes a better approximation of the function. I have found though that changing K throughout the svd run has some positive results. Apparently, data easily overfits and increasing the value of K gave me some good results up to feature 14, yielding instability afterwards, but that was probably due to hitting the boundaries of its effectiveness.

I've calculated that, if performance really is linear with regards to the starting rmse, I'll need to start with an rmse of around 0.8957 in order to hit the 10% mark. So, by just looking at the first feature, and assuming that this performance is linear, I can easily see the results of my changes to the algorithm.

(EDIT: See results. Still rmse was lowering, so will reproduce another run with some other small tweaks and a larger change). More tomorrow!

Monday, March 02, 2009

Chasing the Netflix prize

As I'm naturally interested in algorithms, especially when these are related to Machine Learning or Artificial Intelligence... I'm posting some results here from my research into the Netflix prize so far. My results so far are not very impressive, but I'm gaining lots of new insights into the behaviour of Machine Learning algorithms and the relations between parameters, what they do. You can see on the left an attempt at lowering the Root Mean Square Error, or RMSE. That is the key feature to reduce in the Netflix prize. RMSE is calculated as follows:
  1. Predict the number that a user would rate a certain movie with one decimal accuracy.
  2. Compare the predicted number with the real rating that we know (training data).
  3. Multiply the error with itself (square) and add that to a sum.
  4. Divide the sum of the squared errors by the number of predictions made.
  5. Take the square root of that sum.
Netflix initially had a performance of 0.9514 rmse. Their challenge is to improve this by 10%, thus 0.8563 rmse.

In the picture above, you can see a blue line, which is a very potent reduction of rmse on the training set. It's actually a perfect logarithmic regression, and it follows that line perfectly. However (and I don't have sufficient probe points in red in the graph to demonstrate this), the performance on a set of data that is unknown (a probe to test against or the qualifying set) is gradually decreasing. Meaning, for data that is unknown to the algorithm (the predictions to be made) are getting worse, much worse. That means that the rmse for the probe and qualifying set is getting larger as the algorithm progresses. Thus, the training set is overfitting the parameters very heavily, yielding negative returns for the other data sets.

Looking at the projection of the red line, the rmse of the separated probe (known ratings, but excluded from the rating set) is not converging at some point. My conclusion is that I need to think of something else.

In some of my later iterations that I'm not showing here, it's interesting to see how the performance of the dataset is logarithmic, whilst the performance of the rmse on the probe/qualifying set is neither linear nor logarithmic. The start of the probe rmse is sort of linear, but possibly will become logarithmic after some time. At the moment, I'm at position 1120 on the leaderboard, but I've only touched the surface of this problem and I submitted those results based on a Matrix Factorization algorithm (MF) with only 16 features. As a comparison, some other people have submitted results using over 600 features, calculated over 120 epochs (runs for one feature) or so.