Tuesday, April 21, 2009

Digging RBM's

Okay, so here's some little information about Restricted Boltzmann Machines as it applies to the Netflix prize. I haven't got it working perfectly yet, but getting close. The paper may be a little bit challenging to start off with, but once you get the objective right, things are essentially pretty easy. I'm referring to the paper "Restricted Boltzmann Machines for Collaborative Filtering". The picture here shows the gist of the technique of producing predictions using this method. A user can be represented by its vector. The user vector is basically per column the rating of a movie. If a movie was not rated, then that column is not used in the calculation. By using all the movies that the user rated, the 'hidden part' of the network is loaded into a certain state. This state can then be used to reproduce figures on the visible state of the network, where the missing movie ratings are. And yes, to calculate user/movie ratings, it's simply the act of calculating this rating based on the hidden state in the network. This is done by weight multiplication with the active features in the hidden part, generating the softmax units for that missing movie rating. Hopefully it should approximate what the user really rated.

So is it a neural network? Not really. This Boltzmann machine has the ability to fantasize about what something should look like, based on parts of its observation. So you can use it for various purposes, from pattern recognition to completing parts of a picture or pictures, as long as there are recurring patterns (you can't ask it to reproduce something it's never seen before). Well, at least not yet :).

The way how this thing is trained can get pretty complicated. But enter "Contrastive Divergence". This method allows the machine to be trained pretty quickly. Imagine that by the vectors at the lower part, you're multiplying each individual softmax (5 softmaxes per movie) by its individual softmax weight and then add the bias from each hidden unit to that. Each rated movie in the user vector will either contribute or diminish the activation of the hidden unit. This is the positive activation phase. By the end of this phase, the hidden layer has F units, where x are activated ( 1.0f ) and y are not activated( 0.0f ). Yes, that is thus a binary activation pattern in the simple case (not Gaussian). Sampling in this paper means:

if ( hidden_unit_value > random_value_between_0.0_and_1.0 ) {
hidden_unit_value = 1.0f;
} else {
hidden_unit_value = 0.0f;

As soon as the hidden layer is calculated, we now reverse the process. What if the hidden layer was already trained? Then by having the state in the hidden layer as it is now, we should be able to reproduce the visible layer. But if the hidden layer has not yet been trained, we'll soon see that there's a certain error that'll occur. And if we re-do the positive phase again after that, we'll also see differences in the hidden layer activation.

Now, here comes the beatdown of the contrastive divergence algorithm:
  1. If a hidden feature and a softmax unit are on together, add 1 in the box for that observation (visible_unit_i_softmax_p_feature_j). Thus, stores the frequency that both units were on together. You probably need a 3-dimensional matrix to store this information. And this makes sense, because you also have a weight per movie per softmax per feature. And we want to train those. Let's call this matrix CDpos.
  2. After performing the negative phase and the positive phase again, repeat this process. We store those numbers in other structures suitable for this. Thus, we now have the frequency that softmaxes and features were on together in the first positive phase and the softmaxes and hidden features that were on after one negative and one positive (but you can actually do this a number of n times, as described in the paper, after learning has progressed). The number of epochs for learning is small, the paper mentions 50. Let's call this matrix CDneg.
  3. The learning phase basically comprises subtracting CDneg from CDpos, then updating the weight through a learning rate. Thus:

    W(i=movie,j=feature,p=softmax) += lrate * (CDpos - CDneg);

  4. You can make this fancier by including momentum and decay as mentioned in the paper:

    CDinc = (MOMENTUM*CDinc)+(LRATE*cd)-(DECAY*W);
    W += CDinc;

    UPDATE 30/04 | This should be:

    CDinc = (MOMENTUM*CDinc)+LRATE * (cd - (DECAY*W));

  5. The trick in the negative phase is also if you want to sample the visible vector reconstruction or not. Or do this only in the training phase and more of those decisions. I'm sampling always in the training phases, but only the hidden layer in the prediction phase.
  6. In the prediction phase, I'm normalizing the softmax probabilities, then add them multiplied by their factor, then divide by 5.0f. You could also take the highest probability and then guess on that number. I chose my method, because I think it's likely more accurate in the future. It's got x probability to be a 1, y for a 2, and so forth. Thus, it's dealing with probabilities and if there's a strong pressure for a 5.0, it'll be a five. Otherwise somewhere between 4 and 5.
The rest of the paper are further elaborations on this model. Gaussian hidden features and conditional RBM's. Conditional RBM's basically allow the machine to also learn from missing ratings (so, rather than training just on what you have, you also train on what you don't have. Brilliant!). It also allows to use the information in the qualifying and probe set. So, the machine will know they were rated, but doesn't know to what. That is information and a great thing to add to the model.

Hope that helps!

Sunday, April 19, 2009

Temporal factors in Netflix data

Well, as with many other teams, I am now looking at modeling temporal effects in the netflix data set. I'm using some home-grown SVD system based initially on funks, but quickly added some other parameters to make the RMSE sink faster. I'm now at spot 268 with 0.8975 and just re-running my machine to get lower than that with some new additions.

Time is a bit of a hurdle. But I've got some ideas to get this right. The above is based on a model that uses sub-optimal parameter settings after playing around with some other experiments. But after the run I was too lazy to run it again in the proper way.

So... I'm also going to look at user temporal differences. Here's an example of the weirdest user in the set (with the most time on his hands). Interestingly, this user, bot or agent started out with a "proper" average of 3.4055 in 2001-2002, but then suddenly started using Netflix differently. If you look for posts of a psychologist on netflix, you'll understand what I mean. Some guys like to use 1.0 as "movie has no discernible, interesting features whatsoever" ( and blockbusters may well fall into that), such that 1.0 doesn't really mean "bad", it just means not very interesting whatsoever and I switched the tele off. That basically means the user removed the 1.0-3.0 scale for himself and only uses 1.0-5.0 to indicate better than average movies.

| count | avg_bucket | globavg | start_date | end_date |
| 402 | 3.4055 | 1.9082 | 2001-09-23 | 2002-09-23 |
| 9698 | 2.3414 | 1.9082 | 2002-09-23 | 2003-09-23 |
| 4022 | 1.3993 | 1.9082 | 2003-09-23 | 2004-09-23 |
| 3296 | 1.1271 | 1.9082 | 2004-09-23 | 2005-09-23 |
| 235 | 1.1319 | 1.9082 | 2005-09-23 | 2006-09-23 |

So, one of my experiments was actually to re-center the ratings around the user average using some epsilon function. The idea was that the figures in these sets mean different things, so rescaling it to the global semantics sounded like a good idea. Unfortunately, it didn't work out well at all. Maybe I'll get back to that idea later though.

Looking at the results above again and many others, I do see there's some kind of trend in people using the ratings differently. Here's another:

| count | avg_bucket | stddev | globavg | start_date | end_date |
| 3452 | 3.3566 | 0.6829 | 3.2761 | 2003-09-23 | 2004-09-23 |
| 1968 | 3.1443 | 0.5545 | 3.2761 | 2004-09-23 | 2005-09-23 |
| 191 | 3.1780 | 0.6862 | 3.2761 | 2005-09-23 | 2006-09-23 |

Again, a user who's slightly changing the habits of the use of the system. Probably, they think that there's a lot of 4's and 5' at some point, after which 2/3 become more prominent. This takes some time of course. And what about this one:

| count | avg_bucket | stddev | globavg | start_date | end_date |
| 238 | 5.0000 | 0.0000 | 5.0000 | 2004-09-23 | 2005-09-23 |

LOL. We can reasonably assume that people choose movies that they like to watch (so in general, their average should be higher than 3). But this is ridiculous :).

Another interesting thing. Netflix has told us that they've frobbed the data here and there for anonymization purposes. But watch this:

| count | avg_bucket | start_date | end_date | title |
| 5 | 4.6000 | 2001-06-21 | 2001-09-23 | Lord of the Rings: The Two Towers |
| 1 | 4.0000 | 2001-09-23 | 2001-12-21 | Lord of the Rings: The Two Towers |
| 4 | 4.5000 | 2002-03-21 | 2002-06-21 | Lord of the Rings: The Two Towers |
| 11 | 4.7273 | 2002-06-21 | 2002-09-23 | Lord of the Rings: The Two Towers |
| 33 | 4.8788 | 2002-09-23 | 2002-12-21 | Lord of the Rings: The Two Towers |
| 651 | 4.7496 | 2002-12-21 | 2003-03-21 | Lord of the Rings: The Two Towers |
| 441 | 4.8413 | 2003-03-21 | 2003-06-21 | Lord of the Rings: The Two Towers |
| 8342 | 4.4839 | 2003-06-21 | 2003-09-23 | Lord of the Rings: The Two Towers |
| 17065 | 4.3618 | 2003-09-23 | 2003-12-21 | Lord of the Rings: The Two Towers |
| 13057 | 4.3979 | 2003-12-21 | 2004-03-21 | Lord of the Rings: The Two Towers |
| 13027 | 4.4752 | 2004-03-21 | 2004-06-21 | Lord of the Rings: The Two Towers |
| 10855 | 4.4567 | 2004-06-21 | 2004-09-23 | Lord of the Rings: The Two Towers |
| 15302 | 4.5075 | 2004-09-23 | 2004-12-21 | Lord of the Rings: The Two Towers |
| 22276 | 4.4670 | 2004-12-21 | 2005-03-21 | Lord of the Rings: The Two Towers |
| 17546 | 4.4310 | 2005-03-21 | 2005-06-21 | Lord of the Rings: The Two Towers |
| 19335 | 4.4759 | 2005-06-21 | 2005-09-23 | Lord of the Rings: The Two Towers |
| 12574 | 4.5484 | 2005-09-23 | 2005-12-21 | Lord of the Rings: The Two Towers |
| 655 | 4.5603 | 2005-12-21 | 2006-03-21 | Lord of the Rings: The Two Towers |

Lord of the Rings, the Two Towers came out 18-12-2002 or so. Whoops. That's some 50-60 ratings (possibly more) that have dates before the production date of the movie. Thus, any team that's working with temporal effects should take that into account. The first day of rating doesn't mean much. Possibly, these dates should all be set to at least the release date of the movie first. Since the DVD's aren't immediately available on Netflix, some more time would elapse before you could see them there. Not sure how much. The DVD came out August/November 2003, but rentals probably have them sooner, somewhere in between? Let's say March or so, it's impossible to tell. In any case, the rating dates aren't really useful for temporal calculations in that case (not with any kind of precision anyway). Moreover, if you're using production year + ratings, you should thus not assume that all rating dates are after the movie production year.

Wednesday, April 15, 2009

Consciousness and RBM's

I was exploring some thoughts regarding the origins of consciousness. It's a huge subject to read about and certainly one that allows one to specialize in different ways on this subject. So far, I've read books from Steven Pinker, Roger Penrose, John Holland, general articles on the Internet on this topic, philosophy of mind and the likes.

The one thing that really struck me when I was watching this video, is how important the "fantasizing step" for consciousness is. Fantasizing == imagination == abstract thought and (abstract) manipulations of things seen before in order to construct something new out of past experiences.

So far, neural networks have been viewed from the perspective of recognition, not so much from reproduction of certain action. Also, most neural network activity is one-way. It's true that the learning process requires backwards propagation for weight adjustment, but the general execution phase is from input -> output, never backwards.

But RBM's have the property that they can both be used for recognition as well as production. The production phase is useful for things like prediction. Basically, prediction is all about recognizing patterns or important forces that strongly suggest that reaching a certain state or value is a higher probability than any other. This can be done by reasoning, or calculation or whatever. (see from 21:38 in the video mentioned above to see this).

Now, here comes the fun part. One could consider imagination (+innovation) to be:
  • Constructing coarse solutions through reasoning (needs to have A + B not C).
  • Filling in the blanks of generated course solutions.
  • Backtracking over complete blanks, defining the unknown as a subproblem to resolve prior to resolving the bigger picture.
The interesting parts of these thoughts is that it provides ways for a machine to actually construct thought itself, as long as the premise is true that inputs into the machine can be represented on a very abstract, symbolic level and the machine actually has some goals to follow. Thus, given some goals, it develops subgoals, interprets the environment and constantly redefines subgoals and so forth. There are things missing here of course, but you should think at a very abstract level of representation.

Think of the mind as a huge network of connections like a RBM with different stacks, where different types of processing occur. At the neurons near the eyes, the light gets interpreted and already it contains some sort of pre-processing like edge detection and so on. The next step is to start recognizing shapes between the edges and blotches of color. What does it all mean? I highly believe that we don't nearly store as much detail as we think we do for visual processing. And 100 billion neurons isn't really that much when you think about the amount of information we're really storing, especially when parts of these neurons are contributed to specific tasks like speech production, visual recognition, speech recognition, audible signals recognition, pre-frontal cortex processing (high-level abstract thought), emotional supression / understanding, and so forth.

Now, with consciousness... what if what we're seeing really is the induction of a high-level abstract thought in the pre-frontal cortex towards the lower hierarchical layers in this huge network? Then consciousness is more or less like reliving past experiences in sound, vision, emotion and the likes. It still raises questions on where this induction starts (ghost in the machine), but this may also be explained by (random?) the inverse of the operation, namely the occurrence of a certain emotion, the observation of a visual appearance, the hearing of some sound or the smell of something.

Now, especially olfactory memory, the latter one, is interesting. By smelling freshly cut grass or other specific smells, we sometimes immediately relive experiences from our youth. This is not something that is consciously driven for example, but as of yet totally happens. This is relived not just by smelling the smell, it's a visual, audible and emotional thing as well. The interesting part in this that we seem able to steer our thoughts (concentrate) on certain parts. Steering away from certain thoughts is much more difficult (don't think of a black cat! oops, I asked you not to think of one! :).

So... here goes... can thought be seen as a loop between abstract ideas and reproductions of those abstract ideas made from previous experiences? Someone not having many experiences won't have a lot of imagination in that sense and others with loads of experiences and knowledge may be able to infer more about certain qualities and consider a range of other possibilities and options (experience).

And the other question... can this be used as a stepping stone to model thought in machines? and if possible, could we call this consciousness?

Monday, April 13, 2009

RBM's, Prolog and pictures

I'm just now looking into RBM's in an effort to apply this to the Netflix prize. I'm getting reasonable results with other methods now. My place is 435 at the moment and pretty soon, I should be under 400. RBM's are a bit more difficult to implement than I imagined, loads of factors and intricate mathematical details. Other than that, I'm not throwing away my other methods. I'd like to see how RBM's can be applied to train on residuals, otherwise known as "those hard to rate movies".

There is at least one error that won't go away and that is the fact that a user can simply decide to rate a movie off by one rating point. If that happens, the error probably ranges from 0.5 to 1.4, thus creating a large difference. There's a very large group that's pretty predictable, but most groups are quite unpredictable in their behaviour (or most ratings are).

Well, on another note. I'm now being tortured with prolog. I really like the language, but hate it at the same time, since I'm so much used to procedural programming. I keep looking for for-loops, list iterations, inserts, deletes and the likes, but prolog doesn't truly have them. There is a bit of procedurality in Prolog however, but it's mostly declarative. The bad thing is that it has a couple of tricks that you need to get used to. Especially the starting phase is tricky, but having used the tricks here and there, the thinking inside the language is developing a bit.

Oh, and I made some new pictures with the camera:


Thursday, April 02, 2009

SVD hierarchies disappointing...

I had some good expectations of putting part of the problem into an SVD hierarchy in an attempt to home in on the right ratings sooner by grouping similar users together in what I called genes and then running SVD over that, followed by another detailed SVD run afterwards that would then bring my rmse under 0.9 or so.

Images make things a lot easier to discuss. The image here shows genes, think of them as groups of similar users. The ratings of each user determines the gene that they fall in by calculation of their silhouette and similarity (standard rmse function). There are likely other methods of doing this.

Now, the trick here is that each gene is incomplete, just like the matrix, but because the users are more similar, if you pick up all the users within the gene and then interpolate the rest of its own structure using an external prediction function, the gene becomes complete and becomes more able to classify other users. At least that was the general idea. In practice it doesn't really lead to incredible results.

The problem lies in the density of the information. SVD doesn't need compression or approximation, it likes a lot of information. I reckon that SVD works a lot better once you have a more or less square matrix. 17,700 by 480,000 is a relatively square matrix in SVD perspectives. But the above isn't 480,000, it become 20 x 17,700. And therein lies the rub. The information is so much condensed across the users, that it's not very good for calculating individual numbers for SVD.

It did start me thinking on problems of asymmetry. I was considering that if there are 20 columns versus 17,700 rows, then the problem is that the columns are gaining way too much credit for each learning run (assuming the lrate and K for both are equal). That is why I think square matrices are more suited for SVD. So I also tested out some inequal numbers, just to see what would happen. I wanted the movie factors to gain more information than the gene factors, because for each rating that I'd process (100,000,000), the movie factors would get hit about 5,000 times vs. 5,000,000 for the genes. That's a factor 1, 000. So I tried setting up the learning rate and regularization factor accordingly. Movies would learn 1,000 times faster than genes, still unstable. Possibly, it should both have a very slow learning rate to be successful.

Most of these numbers are empirically determined throug experimentation and have been suggested by others as most successful. Trying slightly different numbers for movies/users doesn't really help much, the overall end result is the same or worse. So best to stick with what you have. Of course, the regularization factor K is related to the choice of learning rate, and I think it's also related to the size that the factors get. In the previous examples, I noticed that the regularization factors are about 8 times the learning rate for the svd factors, but times 50 for the bias learning rate.

So it's more or less back to the drawing table. Basically, I've noticed that SVD works much better if you use a flat starting rate, for example 3.6033 or 3.6043. Then work from there. If you initialize to the movie average, it doesn't work that well. It does make sense in a way, because using an absolute base to work from creates a stable point of reference for the entire matrix.

The best thing to do is generate my new results, open up my error analyzer and find out what else to do :). It is quite possible that there simply isn't a better solution for a single / combined algorithm other than blending 5,000 results together just to get higher on the leaderboard :).