So is it a neural network? Not really. This Boltzmann machine has the ability to fantasize about what something should look like, based on parts of its observation. So you can use it for various purposes, from pattern recognition to completing parts of a picture or pictures, as long as there are recurring patterns (you can't ask it to reproduce something it's never seen before). Well, at least not yet :).
The way how this thing is trained can get pretty complicated. But enter "Contrastive Divergence". This method allows the machine to be trained pretty quickly. Imagine that by the vectors at the lower part, you're multiplying each individual softmax (5 softmaxes per movie) by its individual softmax weight and then add the bias from each hidden unit to that. Each rated movie in the user vector will either contribute or diminish the activation of the hidden unit. This is the positive activation phase. By the end of this phase, the hidden layer has F units, where x are activated ( 1.0f ) and y are not activated( 0.0f ). Yes, that is thus a binary activation pattern in the simple case (not Gaussian). Sampling in this paper means:
if ( hidden_unit_value > random_value_between_0.0_and_1.0 ) {
hidden_unit_value = 1.0f;
} else {
hidden_unit_value = 0.0f;
}
As soon as the hidden layer is calculated, we now reverse the process. What if the hidden layer was already trained? Then by having the state in the hidden layer as it is now, we should be able to reproduce the visible layer. But if the hidden layer has not yet been trained, we'll soon see that there's a certain error that'll occur. And if we re-do the positive phase again after that, we'll also see differences in the hidden layer activation.
Now, here comes the beatdown of the contrastive divergence algorithm:
- If a hidden feature and a softmax unit are on together, add 1 in the box for that observation (visible_unit_i_softmax_p_feature_j). Thus,
stores the frequency that both units were on together. You probably need a 3-dimensional matrix to store this information. And this makes sense, because you also have a weight per movie per softmax per feature. And we want to train those. Let's call this matrix CDpos. - After performing the negative phase and the positive phase again, repeat this process. We store those numbers in other structures suitable for this. Thus, we now have the frequency that softmaxes and features were on together in the first positive phase and the softmaxes and hidden features that were on after one negative and one positive (but you can actually do this a number of n times, as described in the paper, after learning has progressed). The number of epochs for learning is small, the paper mentions 50. Let's call this matrix CDneg.
- The learning phase basically comprises subtracting CDneg from CDpos, then updating the weight through a learning rate. Thus:
W(i=movie,j=feature,p=softmax) += lrate * (CDpos - CDneg ); - You can make this fancier by including momentum and decay as mentioned in the paper:
CDinc= (MOMENTUM*CDinc )+(LRATE*cd)-(DECAY*W );
W+= CDinc ;
UPDATE 30/04 | This should be:
CDinc= (MOMENTUM*CDinc )+LRATE * (cd - (DECAY*W )); - The trick in the negative phase is also if you want to sample the visible vector reconstruction or not. Or do this only in the training phase and more of those decisions. I'm sampling always in the training phases, but only the hidden layer in the prediction phase.
- In the prediction phase, I'm normalizing the softmax probabilities, then add them multiplied by their factor, then divide by 5.0f. You could also take the highest probability and then guess on that number. I chose my method, because I think it's likely more accurate in the future. It's got x probability to be a 1, y for a 2, and so forth. Thus, it's dealing with probabilities and if there's a strong pressure for a 5.0, it'll be a five. Otherwise somewhere between 4 and 5.
Hope that helps!
No comments:
Post a Comment