Showing posts with label rbm. Show all posts
Showing posts with label rbm. Show all posts

Thursday, August 13, 2009

Confabulation theory

Confabulation theory is a theory from Robert Hecht-Nielsen about the cognitive function of the brain, or in other words the working of thought. It's a theory, not an explanation. Confabulation theory basically works by processing lots of information and then from this information, find out which symbols belong together. Those symbols that are often seen together (and in cases by which distance) is information contained within the network. Confabulation can now produce new sentences by continuously generating possible phrases based on the context that is seen prior to that.



In effect, the confabulation theory uses an architecture that can produce entirely new sentences, which are plausible in the context. The interesting thing is that these sentences are also grammatically and syntactically correct. Thus, the rules of a language seem to be embedded within this network.

This does not mean that the machine 'understands' the language, or that it is conscious of the sentences it produces. I think the results should be considered the production of a thought-less brain, which only has the capacity to produce sentences without understanding what they mean. It's probably comparable to the chinese room producer, where it looks at streams of chinese symbols. At some point, the machine understands the order in which the symbols may appear and which symbols are often seen together. When asked to produce a sentence on its own, it uses this knowledge to produce a sentence.

What interested me in the theory is how this network differs from other networks that can fantasize, like the RBM. The RBM is a network that stores knowledge by looking at things and can then complete the signal it is receiving. This confabulation network is slightly different, in the sense that it can project continuations (say, a hypothesis) that are very plausible. So, if you were building a network that can produce responses to sentences, the confabulation network is likely to perform better. But if you ask a confabulation network to recognize a face, it might have good difficulty and the RBM might be better.

The RBM is a flatter network (judging the entire system) in comparison to the confabulation network. The confabulation network just competes between symbols and modules and always takes the highest value of all, but since the signals proceed from those results towards other networks, it's in a sense hierarchical.

It'd be really interesting to be able to identify specific properties of each network and then see if they can be used together. It's also possible that we're thinking about this the wrong way. The continuous processing of the confabulation network is quite different from others. We like to think from static situation to the next, perhaps the entire thing is more dynamic than that and we should focus on generating states, looping back and reprocessing the results, thus continuously adding more results to some hypothesis.

Since A.I. is also a lot about search in search spaces (think chess!), a neural network could be used to generate a hypothesis step-by-step, until it is deemed that a particular branch isn't going to produce a good result, so that it can be terminated.

Tuesday, May 12, 2009

Why RBM's are so strangely weird

I'm getting quite obsessed by RBM's for some strange reason. There's a very strange simplicity to the RBM, a very elegant method for learning through contrastive divergence and a very strange ability for an RBM to model many things. The current science shows and understands that RBM's certainly have limitations, but here we go to try to expand on that.

An RBM is a very strange kind of neural network. Artificial neural networks the way we know them generally work the signal in a forward direction, but RBM's work in a forward and backward direction. In a sense, you could say that it's a little bit similar to our minds in that, when we observe something, we both use the details from the input signal to enrich the actual observations, but at the same time use the information from our experience to enrich or expect what is being observed. I reckon that if we were to only rely on the observed state, that state wouldn't nearly be as rich as our mentally induced state, which blends our experience with our observations.

Here's a video that might blow your mind or not... It's a presentation from Giulio Tononi, which I found very compelling. In this theory, it's not a search about the quantity of neurons required to become conscious, or the localization of consciousness within the brain, but it's more of a theory of the most effective organization of neurons within a network for such a network to exhibit consciousness. (text)

Here's where I paid huge attention. Apparently, having a network that has all neurons connected together is total crap. And a network that is very large and has a high number of local connections can be good at something, but it doesn't have the right properties for consciousness. The best thing is a network with specialized neurons, connected in patches, with long connections now and then to other parts of the entire mesh. Much of the work there is related to quantifying consciousness. By quantifying consciousness, and if this quantification is in step with actual consciousness, one can continue to search for more effective methods of building neural nets or machines.

The property about "patchy-ness" suggests that blindly connecting neurons together isn't the most effective way to build a network. A highly regular network makes any system act like a general on/off machine, losing its specificity of function. Neurons that are not connected enough make it work like having a number of independent classifiers, which isn't good either.

Most NN's and RBM's build their theories around having x number of neurons or elements connected evenly together with other layers and then calculate a kind of "weight" from one element to another. Putting more neurons into a certain layer generally makes the network more effective, but improvement is generally asymptotic.

I wonder whether it's possible to develop a theory, complementary to the theory of the quantity of consciousness, which perhaps as some derivative allows a neural network to shape the network itself, or whether such theories provide better rules for constructing networks. One good guess would be to do observations of biological growth and connection-shaping of a brain or simpler parts and then assess the patterns that might be evolving in the generation of such a network.

Finally, the most interesting words of the hypothesis:

Implications of the hypothesis

The theory entails that consciousness is a fundamental quantity, that it is graded, that it is present in infants and animals, and that it should be possible to build conscious artifacts.

This is a huge implication. And in order to understand it, one should go to the start of this post. Consciousness == experiencing things. As said before, it means that our observations carry detail, which are processed by itself, but which are also completed by previous experiences. Thereby, our actual experiences are not just the observations we make, but a total sum of those observations plus memories, evoked emotions, etc. In a way, you could say that what we observe causes us to feel aroused, or have some kind of feelings, and seeing similar things again at a later point in time might cause us to see the actual observations + previous experiences (memory) at the same time. It's very likely that not all experiences are actually consciously lived, in the sense that we're aware of all possibilities of experiences that we could actually experience, very likely there are many experiences just below the surface of consciousness as some kind of potential or stochastic possibility, waiting to be activated by changes in the temporal context.

For example, rapid changes in our direct observations can cause instant changes to behaviour. This implies that next to observing the world like a thought-less camera, consuming light-rays and audio waves, we're also experiencing the world as a kind of stochastic possibility. The easiest example of demonstrating this is the idea of movement, of intent, of impact and likely effect.

The phrase: "I'm standing at a train station and I see a train coming towards me" contains huge amounts of information. The recognition of the train in the first place, the experience that it's moving towards you by the train becoming larger, the knowledge that the train runs over tracks that you're standing next to, the knowledge that train stations are places where trains stop and your intent to get on the train. Just telling here how much knowledge we apply to such a simple situation demonstrates how we're accepting our consciousness as the most normal thing on earth, which it certainly is not.

Well, so why are RBM's so strange in this sense? Because old-school neural networks don't have these properties. RBM's can both recognize things, but also fantasize them back. There are certainly current limitations. In previous posts I've talked about consciousness that we shouldn't perhaps limit the definition by "consciousness == when humans think or experience". When maintaining a broader definition of consciousness, one can also consider machines or A.I.'s which are extremely effective in a very particular area of functioning and might just be consciousness in that relevant area without having any kind of consciousness of things around. The definition of consciousness here however is a dangerous one, since it shouldn't be confused with behaviour, which it certainly is not.

Food for thought...

Tuesday, April 21, 2009

Digging RBM's

Okay, so here's some little information about Restricted Boltzmann Machines as it applies to the Netflix prize. I haven't got it working perfectly yet, but getting close. The paper may be a little bit challenging to start off with, but once you get the objective right, things are essentially pretty easy. I'm referring to the paper "Restricted Boltzmann Machines for Collaborative Filtering". The picture here shows the gist of the technique of producing predictions using this method. A user can be represented by its vector. The user vector is basically per column the rating of a movie. If a movie was not rated, then that column is not used in the calculation. By using all the movies that the user rated, the 'hidden part' of the network is loaded into a certain state. This state can then be used to reproduce figures on the visible state of the network, where the missing movie ratings are. And yes, to calculate user/movie ratings, it's simply the act of calculating this rating based on the hidden state in the network. This is done by weight multiplication with the active features in the hidden part, generating the softmax units for that missing movie rating. Hopefully it should approximate what the user really rated.


So is it a neural network? Not really. This Boltzmann machine has the ability to fantasize about what something should look like, based on parts of its observation. So you can use it for various purposes, from pattern recognition to completing parts of a picture or pictures, as long as there are recurring patterns (you can't ask it to reproduce something it's never seen before). Well, at least not yet :).

The way how this thing is trained can get pretty complicated. But enter "Contrastive Divergence". This method allows the machine to be trained pretty quickly. Imagine that by the vectors at the lower part, you're multiplying each individual softmax (5 softmaxes per movie) by its individual softmax weight and then add the bias from each hidden unit to that. Each rated movie in the user vector will either contribute or diminish the activation of the hidden unit. This is the positive activation phase. By the end of this phase, the hidden layer has F units, where x are activated ( 1.0f ) and y are not activated( 0.0f ). Yes, that is thus a binary activation pattern in the simple case (not Gaussian). Sampling in this paper means:

if ( hidden_unit_value > random_value_between_0.0_and_1.0 ) {
hidden_unit_value = 1.0f;
} else {
hidden_unit_value = 0.0f;
}

As soon as the hidden layer is calculated, we now reverse the process. What if the hidden layer was already trained? Then by having the state in the hidden layer as it is now, we should be able to reproduce the visible layer. But if the hidden layer has not yet been trained, we'll soon see that there's a certain error that'll occur. And if we re-do the positive phase again after that, we'll also see differences in the hidden layer activation.

Now, here comes the beatdown of the contrastive divergence algorithm:
  1. If a hidden feature and a softmax unit are on together, add 1 in the box for that observation (visible_unit_i_softmax_p_feature_j). Thus, stores the frequency that both units were on together. You probably need a 3-dimensional matrix to store this information. And this makes sense, because you also have a weight per movie per softmax per feature. And we want to train those. Let's call this matrix CDpos.
  2. After performing the negative phase and the positive phase again, repeat this process. We store those numbers in other structures suitable for this. Thus, we now have the frequency that softmaxes and features were on together in the first positive phase and the softmaxes and hidden features that were on after one negative and one positive (but you can actually do this a number of n times, as described in the paper, after learning has progressed). The number of epochs for learning is small, the paper mentions 50. Let's call this matrix CDneg.
  3. The learning phase basically comprises subtracting CDneg from CDpos, then updating the weight through a learning rate. Thus:

    W(i=movie,j=feature,p=softmax) += lrate * (CDpos - CDneg);

  4. You can make this fancier by including momentum and decay as mentioned in the paper:

    CDinc = (MOMENTUM*CDinc)+(LRATE*cd)-(DECAY*W);
    W += CDinc;


    UPDATE 30/04 | This should be:

    CDinc = (MOMENTUM*CDinc)+LRATE * (cd - (DECAY*W));

  5. The trick in the negative phase is also if you want to sample the visible vector reconstruction or not. Or do this only in the training phase and more of those decisions. I'm sampling always in the training phases, but only the hidden layer in the prediction phase.
  6. In the prediction phase, I'm normalizing the softmax probabilities, then add them multiplied by their factor, then divide by 5.0f. You could also take the highest probability and then guess on that number. I chose my method, because I think it's likely more accurate in the future. It's got x probability to be a 1, y for a 2, and so forth. Thus, it's dealing with probabilities and if there's a strong pressure for a 5.0, it'll be a five. Otherwise somewhere between 4 and 5.
The rest of the paper are further elaborations on this model. Gaussian hidden features and conditional RBM's. Conditional RBM's basically allow the machine to also learn from missing ratings (so, rather than training just on what you have, you also train on what you don't have. Brilliant!). It also allows to use the information in the qualifying and probe set. So, the machine will know they were rated, but doesn't know to what. That is information and a great thing to add to the model.

Hope that helps!

Monday, April 13, 2009

RBM's, Prolog and pictures

I'm just now looking into RBM's in an effort to apply this to the Netflix prize. I'm getting reasonable results with other methods now. My place is 435 at the moment and pretty soon, I should be under 400. RBM's are a bit more difficult to implement than I imagined, loads of factors and intricate mathematical details. Other than that, I'm not throwing away my other methods. I'd like to see how RBM's can be applied to train on residuals, otherwise known as "those hard to rate movies".

There is at least one error that won't go away and that is the fact that a user can simply decide to rate a movie off by one rating point. If that happens, the error probably ranges from 0.5 to 1.4, thus creating a large difference. There's a very large group that's pretty predictable, but most groups are quite unpredictable in their behaviour (or most ratings are).

Well, on another note. I'm now being tortured with prolog. I really like the language, but hate it at the same time, since I'm so much used to procedural programming. I keep looking for for-loops, list iterations, inserts, deletes and the likes, but prolog doesn't truly have them. There is a bit of procedurality in Prolog however, but it's mostly declarative. The bad thing is that it has a couple of tricks that you need to get used to. Especially the starting phase is tricky, but having used the tricks here and there, the thinking inside the language is developing a bit.

Oh, and I made some new pictures with the camera:

http://www.flickr.com/photos/radialmind/sets/72157616654463308/