## Tuesday, April 21, 2009

### Digging RBM's

Okay, so here's some little information about Restricted Boltzmann Machines as it applies to the Netflix prize. I haven't got it working perfectly yet, but getting close. The paper may be a little bit challenging to start off with, but once you get the objective right, things are essentially pretty easy. I'm referring to the paper "Restricted Boltzmann Machines for Collaborative Filtering". The picture here shows the gist of the technique of producing predictions using this method. A user can be represented by its vector. The user vector is basically per column the rating of a movie. If a movie was not rated, then that column is not used in the calculation. By using all the movies that the user rated, the 'hidden part' of the network is loaded into a certain state. This state can then be used to reproduce figures on the visible state of the network, where the missing movie ratings are. And yes, to calculate user/movie ratings, it's simply the act of calculating this rating based on the hidden state in the network. This is done by weight multiplication with the active features in the hidden part, generating the softmax units for that missing movie rating. Hopefully it should approximate what the user really rated.

So is it a neural network? Not really. This Boltzmann machine has the ability to fantasize about what something should look like, based on parts of its observation. So you can use it for various purposes, from pattern recognition to completing parts of a picture or pictures, as long as there are recurring patterns (you can't ask it to reproduce something it's never seen before). Well, at least not yet :).

The way how this thing is trained can get pretty complicated. But enter "Contrastive Divergence". This method allows the machine to be trained pretty quickly. Imagine that by the vectors at the lower part, you're multiplying each individual softmax (5 softmaxes per movie) by its individual softmax weight and then add the bias from each hidden unit to that. Each rated movie in the user vector will either contribute or diminish the activation of the hidden unit. This is the positive activation phase. By the end of this phase, the hidden layer has F units, where x are activated ( 1.0f ) and y are not activated( 0.0f ). Yes, that is thus a binary activation pattern in the simple case (not Gaussian). Sampling in this paper means:

if ( hidden_unit_value > random_value_between_0.0_and_1.0 ) {
hidden_unit_value = 1.0f;
} else {
hidden_unit_value = 0.0f;
}

As soon as the hidden layer is calculated, we now reverse the process. What if the hidden layer was already trained? Then by having the state in the hidden layer as it is now, we should be able to reproduce the visible layer. But if the hidden layer has not yet been trained, we'll soon see that there's a certain error that'll occur. And if we re-do the positive phase again after that, we'll also see differences in the hidden layer activation.

Now, here comes the beatdown of the contrastive divergence algorithm:
1. If a hidden feature and a softmax unit are on together, add 1 in the box for that observation (visible_unit_i_softmax_p_feature_j). Thus, stores the frequency that both units were on together. You probably need a 3-dimensional matrix to store this information. And this makes sense, because you also have a weight per movie per softmax per feature. And we want to train those. Let's call this matrix CDpos.
2. After performing the negative phase and the positive phase again, repeat this process. We store those numbers in other structures suitable for this. Thus, we now have the frequency that softmaxes and features were on together in the first positive phase and the softmaxes and hidden features that were on after one negative and one positive (but you can actually do this a number of n times, as described in the paper, after learning has progressed). The number of epochs for learning is small, the paper mentions 50. Let's call this matrix CDneg.
3. The learning phase basically comprises subtracting CDneg from CDpos, then updating the weight through a learning rate. Thus:

W(i=movie,j=feature,p=softmax) += lrate * (CDpos - CDneg);

4. You can make this fancier by including momentum and decay as mentioned in the paper:

CDinc = (MOMENTUM*CDinc)+(LRATE*cd)-(DECAY*W);
W += CDinc;

UPDATE 30/04 | This should be:

CDinc = (MOMENTUM*CDinc)+LRATE * (cd - (DECAY*W));

5. The trick in the negative phase is also if you want to sample the visible vector reconstruction or not. Or do this only in the training phase and more of those decisions. I'm sampling always in the training phases, but only the hidden layer in the prediction phase.
6. In the prediction phase, I'm normalizing the softmax probabilities, then add them multiplied by their factor, then divide by 5.0f. You could also take the highest probability and then guess on that number. I chose my method, because I think it's likely more accurate in the future. It's got x probability to be a 1, y for a 2, and so forth. Thus, it's dealing with probabilities and if there's a strong pressure for a 5.0, it'll be a five. Otherwise somewhere between 4 and 5.
The rest of the paper are further elaborations on this model. Gaussian hidden features and conditional RBM's. Conditional RBM's basically allow the machine to also learn from missing ratings (so, rather than training just on what you have, you also train on what you don't have. Brilliant!). It also allows to use the information in the qualifying and probe set. So, the machine will know they were rated, but doesn't know to what. That is information and a great thing to add to the model.

Hope that helps!