Showing posts with label genetic programming. Show all posts
Showing posts with label genetic programming. Show all posts

Tuesday, March 31, 2009

Genetic neighbors are like k-medoids

Well, I've found some new info on the Internet whilst trying to refine my algorithms here. I'm still in the process of aligning the proper techniques to be able to run a prediction through. The next one isn't optimal still, but should carry along a couple of good improvements.

The technique I "discovered" actually already existed. That's very common, so no problem there. The application and context in which I'm using it though is very different.

In k-medoids and knn, the trick is to select a centroid or "default" that will act as the center of a determined neighborhood. In KNN, generally parameters are averaged out, as such developing an idea of a class of a certain kind that you either match or not. In k-medoids, the trick is to find out which member is the best representative of the entire set (so no averaging there). This is done by finding out the cost of swapping the center to another member.

This thing above is a measure for similarity, otherwise known as "silhouette" in k-medoids. The silhouette provides information how similar a certain data point is in comparison to its default (in my case, the gene). a(x) is the similarity of the data point to the currently allocated gene, b(x) is the similarity of the data point to the next-in-line highest similar cluster. Thus, if there is a very good match with A, but a not so good match with any other cluster, then the silhouette will tend towards 1. If the other cluster is better (doesn't happen in my case), it tends towards -1. 0 is indifferent between A and B.

Now, the trick is to initialize the clusters to some values. I'm still considering how to do this to a better ability, but I'm also considering it may not matter too much in the beginning. It's optimization. Then, set the ratings for each gene to the exact ratings of the training data of the "default" member that was selected. This probably works better when you select defaults that have a high vote count.

Then, run an algorithm that determines similarity. I'm using ((1/rmse)/sumRmse) as a dirty trick for now. It seems to work well. Then record the similarity for each cluster on each customer (so we are classifying customers here for now). Movies can also be clustered in similar ways, likely. And in that case, one could determine biases between user and movie clusters or even use a different matrix factorization module to svd between those as a "global average". Possibly the mf-values for each user / movie become lower in the second layer and hopefully, through the help of clustering, noise is reduced and the mf values for each individual user become more meaningful and less of an average.

What I'm actually doing after allocating users into clusters / genes, is that I'm using that information to complete the matrix. Not user X <-> movie Y as one big matrix, but gene X <-> movie Y. If the users are really more similar, then the error in movie average hopefully goes down if the clusters are properly populated. The movie average is then used as a focal point for further calculations.

You obviously can't use "average" values any longer, because that's already set by the global movie average, unbiased and over a single gene you could say, the entire population. Thus, we need a new, external algorithm that is able to provide those numbers. Global effects could do for now, or perhaps a matrix factorization method.

The "silhouette" calculated earlier for each user then becomes useful. If the silhouette is a measure for how "powerful" a certain user falls into a gene, then we can set a threshold which users actually contribute to that gene and which don't. If you imagine a borderline between two clusters and a lot of users on that "front", then if one doesn't filter those out, those users would pull the clusters closer together, causing frequent swaps from one cluster to the other, making it less stable (I suppose, this isn't lab tested :).

In initial tests, I came up with 10 clusters and that worked well. Calculation time then is about 20 secs or so per cycle. After 10 cycles, there's just noise generating differences, but no apparent useful changes in rmse. This was before my thresholding algorithm.

After thresholding, the rmse drops quite quickly and also stays rather fixed after about 8 cycles. This may be good or bad, I don't know. I probably need some further tests to see what happens after 64 cycles... Does it then become worse? better?

The genes are calculated with a rather discrete formula, there is no learning involved there. If there is actual training data available for a gene, that is used for averaging out, but only if the threshold for that user in the gene is high enough (it's a measure of representation of that user for that gene). Choosing an incorrect threshold kills your changes and flexbility, too low and there's instability and continuous change.

The gene contains floats, and these floats are "averages", and there are as many floats as there are movies. If any real rates were found for a movie in the gene, then those are used. If no ratings were found for a movie, then the customers within that cluster are used to make predictions (if they have high enough threshold), and those predictions are averaged.

Using this method and selecting k=24 or so, I see the rmse going down to 0.957677 or so after 10 cycles. But the changes and improvements quickly become entirely level after that. Increasing the clusters significantly increases computation time. k=10 is reasonable, k=32 is already pretty gruesome to wait for.

There is a possibility that the initiation of the clusters is horribly inaccurate and incomplete. So it's possible it needs a couple of training cycles to get to a state that it is happy with. I'm not yet sure about the movement and progress of the numbers in this case.

I'm now trying to get better results by combining this method with matrix factorization in specific groups and between groups. Thinking about it... splitting up very large results into groups that are better similar towards another as compared to the global set, it makes sense to investigate more a hierarchical ordering of movies and customers and calculate between those groups as well. However, care must be taken not to make the groups too small or too large. Too large doesn't make the group distinct enough, too small doesn't generate the momentum that the numbers need to point in the right direction :).

Before, I tried to apply the similarity rating to a certain cluster in calculations, but it is woefully inaccurate. It's not really any use taking 20% of one cluster and 80% of another, the results are mediocre. What is much more telling however is the silhouette rating. That rating is much more useful for a variety of purposes, since it is a measure how deeply allocated a user is in a certain gene.

Oh... and today my new memory arrived. Super duper 4GB to run multiple instances next to one another. Although I don't have a quad yet, so it can only effectively be 2. And beyond what I mentioned in this post, there are a multitude of other improvements I am planning before my next submission.

Thursday, October 30, 2008

Gene expressions as the process for building programs

A gene expression is the process which eventually leads to the production of a complicated protein molecule. Each protein looks slightly different and has a different role overall in the human body. The encoding of the release of proteins, when and where, is encoded in the genes. Basically, from the DNA transcripts are created (RNA's), which could be viewed as part of a blueprint in reverse form and from these transcripts, the proteins are developed in combination with other processes (regulators). Eventually, the protein is assimilated and then it starts executing its 'designed' function. Some biologists are now working on reverse-engineering this process (thus, reverse-engineering the construction of biological processes as you could call it), back into the programming as it is contained in the DNA.

To call DNA 'building blocks' of life is thus a bit of a misnomer. It's a very large blue-print, or rather, information store. I then think of proteins as agents, constructed through the process of translation of instructions (its purpose) from the RNA transcript. Whereas DNA is just a big information store, the proteins actively carry out the duty as laid out in the instruction set of DNA. These duties can vary significantly. Some proteins help in cell construction, others help by being an information carrier, carrying messages from one organ or part of the body to another, where it meets other proteins (called receptors), causing a biochemical response in that cell, which in turn causes another biochemical reaction which can change our behaviour.

The timing of the construction of certain cells (body development) is contained in the DNA. The DNA will ensure that certain parts of the blueprint are released at the desired time to ensure correct development and functioning. It's difficult not to be in awe of the entire design of life, and how the relatively incomplex function of one element in combination with other not-so-complex functions eventually lead to an emergent intelligent behaviour, or rather, biologically balanced system.

One of the challenges in biology is how to discover where a certain protein, having a certain function, effectively was coded in the DNA. Thus... what did the information look like in the DNA structure which caused a certain protein to have its shape, size and function? Reverse-engineering that process will eventually lead to a much greater understanding of DNA itself. At the moment, this reverse-engineering is mostly done by comparing DNA strands of those individuals that have slightly different features, and then guessing where those differences are 'kept' in the blueprint. Although this is useful, it'll only give indications on what the sequence should be to produce that particular feature, it cannot yet be used to induce a feature that is different from both features observed.

The challenge for computer programs using genetic expressions however is even more challenging. There is no DNA yet for programs from which programs can be written. I really doubt whether they should lead to programs 'as we know it', (thus, a single DNA feature leading to a specific, one 'rule' or bytecode).

Imagine an execution environment in which a neural network could be executed. If the DNA contains instructions to form neurons and synapses, then the resulting network is going to be radically different from any NN design we know nowadays. If proteins govern the construction of the network and its size, then the execution environment itself can monitor available memory and take appropriate steps to regulate the proteins + network in such a way, that it gives the best accuracy and yield (function?). Thus, be a certain percentage of the natural selection algorithm.

The problem remains always in the construction of code, or 'function'. The function that is contained in a neural network will generally be constrained by the preprogramming of the environment itself. That is, the execution environment will be programmed to carry out certain functions, but the execution environment itself cannot 'self-innovate' and evolve new functions over time. So, in other words, it's like saying that the functions that a program could ever develop are those functions which are emergent from the simple functions defined in the execution environment.

Nevertheless, can such an environment with only a couple of pre-programmed capabilities lead to new insights and interesting scientific results?

If we produce the following analogy: "nature" is the execution environment of the world around us. We are biological complex life-forms which rely on this 'execution environment' to act. In this sense, 'nature' is an abstract form, all around us, not represented in concrete form. Our biological processes allow us to perceive events from material elements around us (being either other humans, cars, houses, etc.). We can see the representation, hear it, touch it or otherwise interact with that representation.

Similarly, in the execution environment in the computer, we can give a program developed by a gene expression a "world" or a "material representation". It'll be entirely abstract as in bits and bytes, but that doesn't necessarily matter.

We believe the world itself is real, because we experience it as consistent and always there. But if you've seen "The Matrix" (which btw I don't believe is real :), then you can easily understand my point that the experience of something, or being consciousness of something, doesn't necessarily mean that it has to be real 'as we know it'.

Back to the point, if the program doesn't know any better, it'll just be aware of its own world. That world are the inputs of the mouse, the microphone, internet?, keyboard and so on. The output available to it is the video card (thus the screen), the speakers and internet again. Following that pattern, the program could theoretically interface with us directly, reacting to outside real-world inputs, but always indirectly through a proxy and also indirectly provide feedback. It's as if the program always wears VR-goggles and doesn't know any better, but we can see the effects of it's reasoning on-screen or through other outputs.
  • Enormous simplification of nature (biological processes) == execution environment
  • Material objects == "modified" input/output hardware channels
Of course... one needs to start with the design for the execution environment in the first place :).

Wednesday, October 22, 2008

Genetic programming

The main philosophy behind the previous article was that Genetic Algorithms do not modify the structure of a computer program, but only the contents that the program uses. A program in this case is a specific design for a neural network and other things.

The same article hinted at the assumption that we're inclined to think in states, not in dynamics, and that we're able to reason "perfectly" using explicitly defined states with clear boundaries and attributes. The idea of evolving a program's structure as in the previous post has already been suggested before, but not researched to great extent. The interesting line of thought in those articles is that the program itself is something which evolves, not just the parameters that control it.

Possibly, it's just a philosophical discussion on where the draw the boundaries. The computer itself as hardware doesn't evolve from nothingness and the next thing that engineers will claim is that it should be built 'organically' in order to come up with more suitable 'organs' (hardware elements and devices) more suited to its task.

So having said that, is there any use in laying this boundary closer to the hardware? We'll need to define a correct boundary, representing the boundary between the individual or organism and its surroundings. In this comparison, the program becomes the evolutionary organism and the hardware is the environment. The hardware then becomes responsible for evaluating the program as it starts to move about the space. Programs misbehaving or inept in the capacity to perform its intended function should be removed from the space and a new program from the population should be taken.

One complexity here is that nature itself is analogous in nature and doesn't necessarily prevent or prohibit invalid combinations. That is, the design of it is very permissive. Since a computer has been designed by humans based on explicitly defined theories and models, it is not very difficult to reach an invalid state by the individual, thereby halting that individual's progress (it dies) or in worse cases, halting the environment altogether, requiring a reset. The latter, in analogy with the world around us, would mean that specific mutations in a specific individual on this earth might lead to the immediate termination of us all and a "restart" of planet earth.

So to regard the computer as a suitable environment for running evolutionary programs is a bit far-fetched, due to the, so far, required functioning of a computer, which is to behave consistently and explicitly, according to the rules that govern the execution of a certain program or operating system.

Another problem is that certain hardware is already in place and has been designed according to these explicit hardware designs (not organically grown). For example, a computer keyboard is attached to a computer and requires a device driver to read information from that device. Thus, a keyboard is input to an individual, but it's an organ which is already grown and needs to be attached to the program with all the limitations of its design that it may have. On the other hand, we could also regard the keyboard as some element in nature, for example light or auditory information, the signals of which need to be interpreted by the program in order to process it in the expected way.

Because the computer is not so permissive, it may be difficult to converge on a driver or program which starts to approximate that behaviour. There is only a small set of instructions in the gene which could lead to a valid program. In comparison with nature, it is more likely that the organism wouldn't be invalid, just that it would have features that are not as advantageous to its nature (unless the end result is the same... invalid == "dead for sure"?).

As complexity progresses, small mutations should eventually converge on an organism which is better suited to deal with its environment due to the concept of natural selection. Since a computer is so explicit about well-behaving programs and any invalid instruction anywhere might kill the program, this is reason for some thoughts on perhaps a design which comes closer to some better knowledge of the environment in which the program operates. For example, insert the program inbetween inputs/outputs and let it loose within those constraints, rather than allowing it to evolve naturally inbetween all kinds of input/output ports, hopefully evolving into something useful.

Thus, the biggest challenge here is to find a specific, suitable grammar which can be used to form the program itself, and how the elements of that grammar can be represented by lines of genetic instructions, such that any manipulation on the genetic instructions produce valid processing instructions, never invalid. My preference definitely goes out to runtime environments for handling such kinds of information. Both because the complexity of dealing with the hardware itself is greatly reduced and because it's able to run in a more controlled environment.

Another question is how that grammar is compiled. Graph theory? Can we express symbolic reasoning which come closer to the design of the computer, but which is not as explicit as a rule-based system?

It'd be cool to consider a program running in a JVM, which would receive a stream of bits from the keyboard and then is evaluated on its ability to send the appropriate letter to an output stream, which directs it to screen. The challenge here is to find a correct method of representing 'construction instructions' and in what language and what manner this should be translated into valid code, which is able to run on a picky CPU.