Monday, December 21, 2009

The brain in a vat

Daniel C. Dennett started out his writing on "Consciousness Explained" with a description of a demon, who is trying to trick a brain inside a vat into thinking it's actually inside its real body and world with real worldly experiences. You can easily compare this account to the hooked up people inside the Matrix, where each individual's brain is fooled into having a body, relatives, material things and the day-to-day worries of their lives, but actually hooked up to a large computer, the Matrix, which is conjuring up these illusions to every brain. Dr. Dennett's is giving this a bit more thought and considers the things you need to do for the brain to get tricked like this. Now, let us assume that, contrary to the story of the Matrix, the people in this thought experiment actually have had real-world experiences to compare these senses too. First of all, you'll need to be able to simulate the senses; vision, smell, hearing, tasting and so forth. And here comes the difficult part. The demon or the computer need to be able to detect some outcome, the chosen action and react on that appropriately as the physical world would re-induce their worldly feedback to such actions, in a way that this brain is used to.

In the example, he's referring for example to lying on the beach with your eyes closed and running your fingers through the sand and the feeling that the course grains of sand give you on your fingers. You could also refer to the action of jumping, the feeling of being in the air for only a little while and the thump when you get back on your feet. Or whistling in some echo-ey tunnel made of mostly metal and the sound this returns to your ears, or the joint coordination of some sports game, etc...

So, the difficulty of tricking a brain is in the first place hooking it up in the right places, sending it the right kinds of signals and reading the brain at the right places. A more difficult thing however is that outside the brain, there's a physical world that the brain expects to get feedback from in very specific ways. Especially once it has experiences in this world, it'll look strange if anything of that changes. The illusion will wear off very quickly due to these little differences (although you would argue that if the illusion is near-perfect, the brain will probably start doubting itself?).

The point here is that the whole thing of "Artificial Intelligence" isn't that close yet as people may try to make you think. Computers live in some kind of "model" of the reality, it's a transduction of what is really there to be perceived. Some other transduction is also very likely affecting us (and trying to tell someone else what you experience or what a bat experiences is very difficult to achieve. We simply cannot easily imagine how it'd be to experience something else entirely). This model is an electronic or otherwise suitable representation of the world around a robot or AI, and therefore subject to our imagination and experience, but not necessarily the most ideal.

Worse yet, biological entities around us are evolved in the physical world. Probably according to Darwin's Origin of Species, where evolution is largely a function of natural selection of creatures, constantly (needing to) adapting to their environment. The same ideas can be found in Genetic Algorithms, a method of engineering where algorithms or parameters thereof are encoded as genetic material and then mutated and crossed over just as biological material. This is sometimes useful in cases where the actual (very long!) functions are very difficult to discover.

In supervised learning, where the actual solution is perfectly known, artificial neural networks can be taught these patterns by feeding the input and then checking the outcome and recalibrating the parameters of the network until the performance of the network no longer improves. There are difficulties in this kind of training, namely that a network that is too large overfits the training and performs generally badly on unknown cases that it may have to predict. A network that is too small is generally not capable of holding the information that is needed to capture the actual function, so underperforms.

So, for neural networks, one either ends up remodelling the entire world and calculating elements of that, such that a network can be trained to recognize or predict it, in which case you're better off sticking to this representation of the world instead. The alternative is not pretty either. One must find a method to evolve a network such that it has the correct architecture for the job at hand and train it without knowing the elements in advance. More so, because these more complex networks don't generally have classification outputs like most ann's do, but instead, I think, probably thrive on the state within the network itself, induced on their neural surroundings.

Monday, December 14, 2009

Combinatorial explosions

I'm reading, next to "Consciousness explained", a book from one of my favourite writers Steven Pinker. This one is called "A blank slate", otherwise known as tabula rasa, where this blank slate refers to the idea as the brain being a type of blackboard that is written to by experiences, or whether certain capabilities and intricacies are preprogrammed, otherwise called innate. Pinker's writing is everything from an exploration, to various explanations, but not in the least a (strong) argument about how we should interpret this "blank slate" and how, in the search for an explanation about intelligence and consciousness. Probably it's best to see his presentation over at TED talks. But be aware that this presentation doesn't cover the topic of the book entirely and that there are many more little facts and explorations that the presentation certainly doesn't cover. Other than that, it provides a good insight into his kind of thinking and scientific approaches and validations.

The combinatorial explosion comes into effect when the number of variables increases by such an amount, that a true understanding of every state and how states influence each other becomes an impossible task. In two examples given, there's the case where the human genome was analyzed and in some reports counted as the number 30,000. The total DNA is a bit more than that, but 1.5% of this are these 30,000 genes, the rest is considered "junk DNA" (apparently and allegedly). Well, looking up the numbers on different sources, you get different numbers here as well. Some sources say 35,000, others 30,000 and Wikipedia claims 23,000. Which one is the true number, we'll never know :).

There's a kind of worm though that has roughly about 18,000 genes (or 20,000 as in other claims?). This worm has 203 neurons or so exactly, avoids certain smells and crawls around looking for food. How come such a worm with 18,000 genes is so strikingly different in behaviour and intelligence from us humans, where we have "only" 35,000 genes? Are we so much like worms?

If you consider the numbers as a linear base, then this must indeed be shocking. But genes amongst themselves interact and may then inhibit or activate other genes that create different proteins, which in turn influences the way an organism develops and grows and at which time-scale in life.

Here we go... The combinations of 18,000 genes interacting amongst themselves is already staggering, but 35,000 genes is a factor 2^17,000 larger. Whoops! That has the potential to be substantially more complex than anything we've ever seen before :). This doesn't mean that all these genes do interact, but the amount of information in this sense cannot easily be captured by just listing the numbers of genes. The actual information is contextually determined by the numbers that actually do interact together, and whether there are also combinations of 3-4-5 genes that form some kind of composition together. In those cases, the complexities go up to 3^x or 4^x even (we assumed interacting genes together before).

Now, it is interesting to find out how these genes interact and how influence these genes really have on our thinking and behaviour. We generally consider the environment a very important element in the analysis of behaviour and growing up, sometimes to such an extent that the entire evaluation is attributed to how some environment determined a person's actions and behaviour. But if we find out that general behaviour, personality twists, general motives and so on aren't necessarily determined by the environment, but hardcoded in genes. It's just our experience that allows us to exhibit this behaviour or not, then the picture severely changes. I'm starting here with my own thoughts by the way, this is not necessarily what was written in the book.

In that case, experience is more of a method to determine probabilities, elements of chance and other things that either inhibit our motives or stimulate them. In this view, our personality and things we do are genetically determined, while our dynamic interaction and behaviour choices are mostly governed by experiences from the environment. The general observations that one can draw from this is that personality twists, likes and dislikes probably come out as characteristic features of a person, but they are genetic. Whereas specific choices not to do something or go for it all the way are potentially given by variables in the environment. This sheds a totally new light on behaviour and how we perceive it in general.

Another thing that I found interesting is the way how it interlinks with computer science books about modularity and compositions of specific system parts with a particular purpose. Rather than thinking of a computer as a single thing, you can divide it into multiple elements like the CPU, the harddrive, graphics card, etc. But if you look closer, than the CPU is a large number of transistors with a number of pins plugging into the motherboard somewhere, such that it is linked with memory and buses on the motherboard to give it the power it has. The CPU is often called the central part and other parts are ancillary to that (well, you might also argue that the motherboard is the main part, because that's what everything is slotted into).

In this way, you can look at the entire computer from totally different views, each explaining very different purposes and levels of abstraction. Looking at the transistors of the CPU, there is no point in discussing why a word processor does all the things that it is told to do, the level of detail is too high to consider it. A more appropriate level is to consider the functions of the computer as a whole and then to explain how people interact with it, why a computer reacts and acts the way it does (it's been told to do that by its designers) and so on. There is also the level of how devices interact that is interesting. To handle keystrokes for example, you could consider yourself part of this system, the input provider. The keyboard is a transducer that converts a small burst of electricity into a scancode, which is read from the USB port of the computer. This scancode, a byte, is then processed by the CPU and sent to the OS and program, which determine if the scancode should be discarded or accepted. The program may then decide to attach the scancode to some array of bytes it has in memory, completing a long line of character strings. For feedback into this entire cycle, the graphics card gets a pointer into this array and repaints the screen when needed.... PHEW!

For each of these things, you can go down to the signal level even, but also further than that on the physics level of electrons. Explaining this process through electrons is going to be a long sit-in, so let's not do that here. At the highest level, it just seems to make sense. You press a key on the keyboard and that makes the character appear there on screen where the cursor is... Is that so hard to understand? :).

Similarly in the understanding of our thought processes, there surely seems to be good room for finding out how networks interact and process or store information. I don't think the sciences in neural networks can be called complete, in the sense that we know everything about them and what they do. One idea for example is that neural networks are very good at processing signals and outputting another. Basically directly responding to signals. But the way how we compose neural networks in amateuristic ways doesn't yet provide handholds for doing more things with them. Biological networks may have a lot of "failover cells" in them that are not strictly necessary to make something function. Also, the human brain consists of 100,000,000 neurons, but 60,000,000 of those are necessary for direct, muscular responses, instinct and movement (the reptilian brain). That leaves "only" 40,000,000 cells for human reasoning, visual perception, auditory perception, speech and other functions. Hmmm... that does shed a different light on things.

Basically, numbers by themselves don't give you information about actual complexity. It's about the interactions between these components, what they can do together in unison that's yielding the most power.

Tuesday, November 17, 2009

Sane Value Decomposition

Oh no! Another article about SVD! Well, this one is not specifically about SVD, but it is describing some of my theories about parametrization of SVD and why so many people get different results and possibly why I didn't win! :).

To explain my theories, we need to first understand SVD in some reasonable way. The SVD formula goes:
A = U Ʃ V

Which means that the above (for the purposes of this explanation) square matrix is decomposed into other equally sized matrices U, Ʃ and V. The U and V are orthogonal matrices, the Ʃ matrix contains the so-called singular values.

More intuitively said, the U matrix is a depiction of the actions of the rows in the matrix, whereas the U matrix is a depiction of the columns. Multiplying the values in a row of U with a column in the V transposed matrix (V.T), multiplied by the value in the diagonal of Ʃ that corresponds to the row,column selected, you get the value of the original cell in A. So, svd on a matrix A gives a strict representation of the values in A, but in a different 'decomposed' way.

But we don't have to go all fully-dimensional, because having all matrices around means we need three times more storage than the original one, and that's not the point. So the idea is to reduce the dimensions of each matrix and keep only the first x around. x however should be a reasonable percentage of the rank (row/column size), because if x is too low relatively speaking, you're not getting good results. I will show how this all works in more detail using python, a language that's very quickly starting to become my favourite language for experiments!

Example:

This is a square matrix with random values between 1-5 of n x m = 5 x 5 dimensions. So 25 numbers in total.
[ [3,5,2,4,4],
[3,2,5,5,3],
[4,4,5,4,3],
[2,3,4,5,2],
[1,5,5,2,2] ]
One thing I discovered (which many sites probably mention in not so readable terms :), is that when the matrix becomes relatively large, the first singular value on the diagonal approximates the average multiplied by the square root of the total numbers available in the matrix ( in the case above, the average is 3.48 ).
Ʃ(1) = avg * sqrt( number_of_cells )
The second singular value for larger, irregular matrices is approximated by:
Ʃ(2) = sqrt( 2 * Ʃ(1) )
So there certainly is a pattern being generated. Each singular value in the decomposition is actually a gain factor. So, if the matrix is really, really large, the singular values become in the orders of 1000's and the values in the first columns of U, V become 1/1000 in the case of this particular range for the Netflix data ( 1-5 across the matrix ).

Here are the actual singular values when the matrix is decomposed:
sigma = mat( [ 17.76224954, 3.5139895, 3.16968094, 1.60407331, 0.73105457 ] )
And here are U and V:
mw = mat( [[-0.44379232, -0.33273609 ,-0.80072792 , 0.11314317 ,-0.19587877], \
[-0.4616133 , 0.58960592 , 0.15588499 ,-0.08007446 ,-0.63919166], \
[-0.5053614 , 0.05135939 , 0.03629005 ,-0.69454331 , 0.50819749], \
[-0.41956887 , 0.28391073 , 0.09075436 , 0.69668699 , 0.49974748], \
[-0.39816248 ,-0.67705869 , 0.57007135 , 0.11412068 ,-0.21225762] ] )

cw = mat( [[-0.33638547 , 0.24667399 ,-0.32741104 ,-0.73031105 , 0.43063272], \
[-0.47365356 ,-0.80039862 ,-0.13379567 , 0.1795791 , 0.29131499], \
[-0.52873629 , 0.082443 , 0.81168849 ,-0.18044758 ,-0.14913599], \
[-0.50662787 , 0.5372694 ,-0.21592459 , 0.61450025 , 0.17445861], \
[-0.3553354 ,-0.05530591, -0.4116298 ,-0.15564457 ,-0.82280841] ] )
Notice how in the U and V matrix, the numbers are maintained at roughly the same magnitude throughout each dimension (you need to read rows and each column further to the right is one dimension deeper). This can be done because if you follow the implementation of SVD correctly, the singular value is the gain factor. So, choosing the correct 'gain' is essential for proper functioning of the model, or you need to adjust the parameters in each dimension.

What some people did in Netflix for example is to assume a gain of 1.0f (the singular values were not used) and then initialize each dimension to some 'low' value. Naturally, you can see that these 'low' values are actually really low in the beginning, but once the model starts to hit other dimensions, you may suddenly hit a gain factor that is above the singular value of the actual matrix Ʃ. That means the model becomes immediately instable and will start producing very high values.

Let's look at a regular matrix:
[[4 4 4 4 4]
[5 5 5 5 5]
[3 3 3 3 3]
[2 2 2 2 2]
[1 1 1 1 1]]

U:
[[-0.54446573 -0.21442291 0.76382282 -0.25982498 0.08152026]
[-0.65208362 -0.46955287 -0.55122106 0.17692309 0.13842191]
[-0.3619848 0.49755586 0.15572615 0.71006801 -0.30489008]
[-0.31846257 0.36154464 -0.26944015 -0.60360543 -0.57526478]
[-0.21422565 0.59604242 -0.12602146 -0.1807017 0.74182633]]
sigma:
[ 17.71717246 1.04811144 0. 0. 0. ]
VT:
[[-0.53913501 0.50275905 0.61561055 -0.13109677 0.24577241]
[-0.40915723 -0.26153467 0.04808419 0.872681 -0.01748586]
[-0.35699995 -0.56822929 -0.1744093 -0.31507809 0.64805378]
[-0.37793419 -0.44513203 0.23939904 -0.33777929 -0.69829541]
[-0.52119151 0.39724791 -0.72868447 -0.08872685 -0.17804491]]
Whoops, what happened? We now only have 2 dimensions left, but after the third dimension all values are 0? This is because the matrix is so regular that it can be described perfectly with two factors. Hmmm.... So if there is good regularity in the data, then we need less numbers to represent the matrix! Of course, the above is taken to the extreme, but looking at recommendation systems, there are certainly regularities:
  • Some users are very constant in their ratings and generally use a scale of 4-5 for everything.
  • Nobody wants to rent bad items, those are generally avoided, so the likelihood that someone gets an unexpectedly bad item is relatively low. That means that relatively speaking, ones and twos occur less often than 3-5.
In essence also, you don't get a lot of information if rows or columns only contains 5's. The information is in the irregularities actually (or temporal trends, but that's another story! :).

More on the theory... Suppose we stick with the irregular 5x5 matrix above. It's basically random. I wanted to find out how fast I can converge to this 5x5 matrix using both some clever initialization techniques and gradient descent. (gradient descent is where you assume some random factors mw/cw, then calculate using those factors your prediction, compare this with the real number and adjust the factor by the error that you get from the comparison. This way, you can "train" the decomposed matrices U and V individually without having to store the entire matrix A in memory (which you don't have anyway, except for a small percentage of, say, 1% of it ).

When you look at the matrix, in the first feature / component that you calculate, you need to make a jump from 0 to somewhere in the 1-5 region. And all numbers 1-5 are positive. This is good, because you know that either the features need to be both positive or both negative. Make your choice. A good approximation for each feature further on is the sqrt( average ) of each respective column and row. This is because in the final calculation, the value in the cell is hopefully already quite well approximated by:
Aij = Ri1 * Cj1 * Ʃ1
In practice, for slightly larger matrices and those that are totally random, the root mean square error of each estimation vs. the real value will be around 1.41 or so. Of course, this is a function of the range used in the actual matrix A. It also shows that if the matrix is highly irregular, you're not converging very quickly to an error rate where your model becomes useful for making nice approximations.

Learning matrices is a tricky operation. Assuming the matrix from above, which has full information available, is square and relatively easy to learn, if the learning rate is too high, the model becomes unstable and the RMSE increases at some point exponentially. Ideally, you start from a good initial point, so that the gradient descent only needs to settle to the minimum, but this is only possible in the start situation. A good initialization of a model uses the following formulas:
  1. Calculate GLOB_AVG from entire matrix
  2. Ʃ1 = GLOB_AVG * sqrt( num_cells )
  3. Ri1 = sqrt( Ravg / Ʃ1 )
  4. Cj1 = sqrt( Cavg / Ʃ1 )
  5. lrate_first_cycle == slow (it's almost there) == 1 / Ʃ1
  6. INIT = ( GLOB_AVG / ( 2 * sqrt( 2 * GLOB_AVG * sqrt( num_cells ) ) ) )
  7. Rix (x>1) = rand( -INIT, INIT )
  8. Cjx (x>1) = rand( -INIT, INIT )
  9. lrate_other_cycles == depends on the data :)
The latter comment was made, because I don't have sufficient empirical insight yet into how the learning rate should be calculated. It also depends on the objective. If you have full data available and you just want to deconstruct quickly, then you might want to go for fast and high learning rates. But for others, accuracy is more important and not having too many factors lying around. Then a fast enough, but much slower learning rate is more appropriate.

I'm still looking into data inside matrices that may be imbalanced. E.g., having 17000 ratings on a single row vs. sometimes 1-5 ratings in a column. The problem here is that a single row is updated 17000 times vs. an update frequency of 5 for a column. Obviously, this might give rise to a very quick imbalance in the model and maybe trying out different learning rates, one for columns and one for rows may be an answer. Some practical experiments there however have shown that this further destabilizes the model... so what can you do?? :).

Wednesday, November 11, 2009

Netflix prize revisited

The netflix prize has been over for some months. I ended up 215th in the high score table. Considering my involvement over time, it's probably where I should have ended up. Other groups have been at the prize for three years and rightfully ended up in the top 40.

If you still have a research interest in the Netflix prize, you can still download the data from the UCI Machine Learning Repository. This set also includes the actual ratings that were used to rate the submissions. It also includes the winning submission set.

A very interesting implementation for solving this problem is the PyFlix implementation. This implementation requires the downloaded data, but converts it into a binary database that is mmap-ed later into the process space. That's quite clever, because using mmap you can also dump some index or mmap some index once you need it. Certainly, you shouldn't do this within loops, but it goes to show that for performance computing, using files is generally a much better approach than databases. Of course you need to work on the files to get them to do something for you.

If you look at the Basic Usage of the Trac page of PyFlix, you'll see some examples of the very nice interface that was built on top of the data. Now, researchers often have not so much data to proof their concepts and the interface built on top of the netflix data in such a way is remarkably elegant for looking into similarities and trying out a couple of concepts from the python interpreter directly.

I have found however that a production implementation of svd or any other algorithm isn't truly viable in Python because of the CPU constraints and the overhead in a number of things. One of those things being the following flags (gcc) for the binaries built on my machine:
-Wall -march=core2 -mtune=core2 -O2
These show that the binaries to be built are tuned specifically to my processor that I am running, instead of generic "i386" architectures. I'm also no expert on Python, so there may be many ways to totally optimize python such that it runs much better. The flags above will generate code that is much more efficient and reduces a single cycle of 27 seconds on generic i386 architectures to 7 secs in total.

Although programmers don't worry about memory in general that much, since they don't need to (readibility of code and other quality attributes need to be paid attention to as well), for certain loops that are executed a very large number of times (in the millions), it becomes much more important to focus on the actual performance of that loop. This is a very nice article about memory, written by the maintainer of the glibc library of Linux. The glibc library is the glue between the kernel and your application basically and it has a number of handy low-level utility functions that every application uses (like strlen, strstr, etc.).

One of the important aspects of maintaining performance is trying to sort data (if possible functionally and technically), such that the processor doesn't get a cache miss to acquire the data. It will then be much quicker in those kinds of loops. Another kind of performance hog is the ordering of data, for example:
my2DArray[ ROWS ][ COLS ];
When cycling through this loop, you'll want to do this by rows first, then cycle on the columns. The columns are probably linearly aligned in memory, one after the other. So you'd typically iterate through this array through:
for i = 0 to ROWS:
for j = 0 to COLS:
my2DArray[ i ][ j ]
Compare that to:
for j = 0 to COLS:
for i = 0 to ROWS:
my2DArray[ i ][ j ]
The second has cache misses all over the place, which is very bad for performance. Ideally, for computational tasks you find an algorithm that both keeps data close in the processor cache as long as possible, but of course only if that is feasible.

The implementation of pyflix is sneakily reading from disk and doing quite a bit of things in the background for every iteration of the rating loop. This is severely hurting performance in the long run. The good thing is that there's a very elegant API to access the data for other purposes and this API does include a rather fast index. It's as if a very tiny little specific database engine was written to access the data, which is a remarkable and impressive feat by itself!

Monday, November 09, 2009

SVD in Python

Here's a small example of Singular Value Decomposition using Python:
from scipy import linalg, mat, dot;
matrix = mat( [[2,1,0,0], [4,3,0,0]] );
print "Original matrix:"
print matrix
U, s, V = linalg.svd( matrix )
print "U:"
print U
print "sigma:"
print s
print "VT:"
print V
dimensions = 1
rows,cols = matrix.shape
#Dimension reduction, build SIGMA'
for index in xrange(dimensions, rows):
s[index]=0
print "reduced sigma:"
print s
#Reconstruct MATRIX'
reconstructedMatrix= dot(dot(U,linalg.diagsvd(s,len(matrix),len(V))),V)
#Print transform
print "reconstructed:"
print reconstructedMatrix
This code prints the following:
Original matrix:
[[2 1 0 0]
[4 3 0 0]]
U:
[[-0.40455358 -0.9145143 ]
[-0.9145143 0.40455358]]
sigma:
[ 5.4649857 0.36596619]
VT:
[[-0.81741556 -0.57604844 0. 0. ]
[-0.57604844 0.81741556 0. 0. ]
[ 0. 0. 1. 0. ]
[ 0. 0. 0. 1. ]]
reduced sigma:
[ 5.4649857 0. ]
reconstructed:
[[ 1.80720735 1.27357371 0. 0. ]
[ 4.08528566 2.87897923 0. 0. ]]
And with one more dimension for sigma:
reduced sigma:
[ 5.4649857 0.36596619]
reconstructed:
[[ 2. 1. 0. 0.]
[ 4. 3. 0. 0.]]
This is how you can use Python for quick tests and experiments on SVD.

Wednesday, November 04, 2009

Abstract thought

Thought... what is it? I've posted before on the topic of consciousness and thought. Without any final conclusion, the topic of thought is discussed in philosophy with differing opinions on the matter. Some say that thought has mystic properties and is only reproducible in biological matters, some in that camp go as far to state that thought is purely human and is what separates us from animals. Could be, since there surely are specific differences in the way we learn, reason and behave, even compared to monkeys. See "Ape Genius" here for example. The last video talks about a very important difference, namely pointing. The questions posed in the research is whether apes are more or less thinking like us or share specific traits that make them behave like us. Looking at the videos and specifically the parts about cooperation and learning, I have personally come to the conclusion that there is not that much in common (the problem is that in a way, apes look more like us than, say, horses, so we're inclined to believe they think like us for that reason. But are they really the same once you completely ignore the 'look-alike'? ). Back to the question at hand... there are other streams in philosophy that believe thought is computational. Then there are once again subdivisions in that region. Some say that the mind is partly computational, but has other traits that are incredibly hard to model and execute on a computer for example.

Scientists now believe that they can recreate thought by replicating neural networks. So the idea is to think of a common task and then proof that this task can be satisfiably executed by an artificial neural network running in a computer. The problem here is that the neural network is trained for a very particular task and there is no reasoning taking place other than the execution of that particular task. So the neural network expects a range of inputs and will calculate the correct output based on those. If the inputs are out of range, the output is not guaranteed to be useful. Also, you will only get a meaningful output for a specific purpose, not an output that is meaningful in different scenarios.

The biggest problem here is that we can't think in sufficiently abstract terms and the relations between those terms. Because we cannot imagine 'pure thought', what it looks like and how it can be alternatively represented, we keep pushing buttons in the hope that somewhere we find an interesting response somewhere that indicates some kind of causality of the external world and internal processing.

In order to simulate thought in a computer, one must assume that thought is purely computational, otherwise the motivation and the execution of the research is contradictory. Pure computational thought requires us to think differently about representations and find other ways to represent parts of the meta-model of the outside world. The world out there has a form and when we see a cat, we don't pick it up, put it in our head and reproduce it later in thought. So, thought requires us to model those things differently in our minds such that they can be reproduced later. Whether this be a word or a number is not truly relevant. The relevance is related to how the relations between these concepts can be maintained. So, reasoning about things isn't exactly about representing the concepts, but about representing the relations between concepts, what things do to one another or how they are typically related.

Singular Value Decomposition, often discussed on this blog within the context of collaborative filtering, has the ability to express patterns of co-occurrence or relations between numbers or items. And here starts the rub. For SVD to be useful, the designer / modeler needs to determine the correct way to put the numbers into the matrix before the calculation is started. The model dictates for example that users go into columns and movies go into rows. Then for each combination, a number is inserted, yielding a huge matrix of interrelations between instances. The interesting thing here is then that one movie relates to many users and one user relates to many movies. So, in essence, the preference of a user is modeled within this matrix and related to the type and characteristics of a movie. In a sense, this means that preference is modeled against characteristics. We don't have any data available about movie characteristics or user preferences directly, but generating this matrix we can suddenly start reasoning with those, although the exact meaning of preferences and meaning of characteristics, appearing as numbers, may not be derive-able.

And here goes... in order to make those preferences and characteristics meaningful, one should have a classification system at hand that can compare classes with one another. Classification means comparing two things to one another and trying to find the most important characteristic that make them match or differ. That operation is different from the calculation performed earlier.

So this goes back to our incapacity to think in truly abstract terms. We can get a feeling for something, but then if it is abstract, can't describe it. Although we are certain about incompatibility, incongruence or similarity for example. A computer model where these abstracts can be manipulated and translated into something meaningful, classified and everything backwards is going to be a very important step.

I think of the brain not as a huge number of neurons that are interconnected, but I think of each neuronal group as some kind of classifier or work item. In that sense, one can question whether it makes sense to simulate 100 billion neurons if the total effect of those biological computations can be simulated more effectively using stricter and cheaper mathematical operations, or a simulation of neuron groups in a neural group network instead, severely reducing the dimensions that are (thought) to be necessary.

This is a great question for research. Can a machine that is constructed from bits, which are 0's and 1's, therefore have no intermediate state, work with numbers and symbols in such a way that it starts to approximate fluid thought?

Sunday, November 01, 2009

Building a mapserver with a karmic koala

I've updated my Linux system to Karmic Koala over the weekend. It seems to work quite well. For the first time, I decided to kill all the binaries that somehow made it to my machine over a course of 2 years and do a fresh binaries install, keeping my home mount with data. That worked out well and the machine even booted my RAID-0 array with dmraid without any problems this time. Ubuntu 9.10 works like a treat, you should try it!


Getting down to business, if you want to find out how Google / TeleAtlas renders maps, here's a page that gives you an idea how the process works. A mapserver is basically an image server with a HUGE, HUGE, HUGE number of tiles behind it. Each zoomlevel maintains its own set of images basically, so that's why adding a zoomlevel to a finer-grained level will be costly space-wise. The tiles are constructed by adding GIS information of different types together from a rather large database. The tutorial that I found here is very easy to follow and the most comprehensive on the subject.

In this tutorial, they end up building a little world map, which I attempted and worked out, as you can see. The world map on the right was constructed by the information of SHAPE files on my server. The overview is a very generic image, but the information in the SHAPE file is so great, that I can zoom in to great extents and produce the complete coastline of Antarctica or any country in Europe. Thanks to the efforts of the Openstreetmaps project that is and people and organizations that collaborate with them. Notice also how the world seems to appear slightly distorted. I think this is related to the chosen projection method.

Well, once that is working, you can load your own spatial data in postgis and postgres and start drawing specific parts of your interest on detailed images. Instead of writing your own programs to do that, just use the utilities and scripts of the mapnik project. An example of that is here, central Amsterdam:

So this is great. I can now produce very detailed streetmaps of any region in Holland, reason about those places and ways through a database with spatial reasoning predicates, find out extents of regions, and so forth. Mapnik also provides scripts to generate tiles from a given 'planet' database. Tiling a country however can produce a very large amount of data on your PC, so use with care :). The above image was produced using standard styling rules. It is possible to adjust this styling or replace it entirely, such that it becomes more personal. These generated images, together with a bit of JavaScript and the original PostGis database as a backup to the locations of points of interest are at the core of Google Maps.

Well, another interesting application of GIS information is by super-imposing data from different sources over the data in the database, or not rendering specific sets of data from the source in the database so that they have less information, making it easier to focus on the important bits. You can see how Bjørn Sandvik made a thematic mapper for Google Earth by generating KML from thematic data merged with (simplified versions of) world boundaries. Although KML takes some time to render, especially when in 3D (he wrote a nice, detailed paper about the techniques though), you can generate 2D images by loading your thematic data in PostGis first, then relating your data rows with the geographical data. Using a clever query and the pgsql2shp tool, it should be possible to output a file with the attributes you require for rendering. The last step is then to spit out an XML rendering file for mapnik, which basically filters your attributes, assigns colors or other styling measures and then run it through the mapnik renderer.

There's lots of things one can do here. Be reminded that dealing with these tools can be a bit daunting at first. There's generally no need to write complicated mergers/processors, because you can use PostGis as an intermediate data store, which can output .shp files (the most portable format I reckon), which other tools can visualize or process further.

Tuesday, October 27, 2009

Drawing your country with openstreetmap

In some previous project, I worked with GIS data to show demographic data on Google Maps. The underlying database is PostGreSQL. This post shows how you can use (part of) the database of OpenStreetMap (verify license!) to extract features from the dataset to paint this onto an SVG or PNG image. Using the PostGIS database in combination with the loaded data, you can extract features (these are like gmaps overlays, or GIS layers) and include them in the image. You want to include railroads and regular roads? Not a problem! You only want to include waterways and the general shape of the country? Can do! The image above was constructed by taking water, waterways, train railroads and forest areas of Holland. Then I zoomed in a bit to show the level of detail.

Since I mostly use Ubuntu, I'll explain the steps used to get things done.

Prerequisites:
  • Postgres 8.3 + Postgres 8.3 server-dev package
  • Postgis
  • cmake
  • qmake
  • libqt4-dev
  • bz2-dev library
  • geos library
  • gdal libraries (libgdal, gdal binaries *and* libgdal1-dev)
  • libxml2-dev header files and library
  • Benelux or other data: ( http://planet.openstreetmap.nl/ ) (no need to unpack! 170M packed, 1.8G unpacked )
Then, you probably need to download the SVN version of a utility called osm2pgsql. This utility allows you to load in the Benelux data into your postgis database. You can get it using:

svn co http://svn.openstreetmap.org/applications/utils/export/osm2pgsql/

Then, make and make install. This probably doesn't copy the default.style across from the svn checkout directory to where it is expected. So:

# mkdir /usr/share/osm2pgsql
# cp default.style /usr/share/osm2pgsql

Now, you're ready to load the Benelux data:

osm2pgsql --slim -c -l -d planet-benelux-latest.osm.gz

Ok, so let's start visualizing this for a bit. You'll want to get qgis, compile that, install it, run it and then connect to your database:

# svn co https://svn.osgeo.org/qgis/branches/Release-1_1_0 qgis_1.1.0
# cd qgis-1.1.0
# mkdir build
# cd build
# ccmake ..
(verify output, resolve errors). Press 'g' to generate scripts

Then you should be able to run qgis from the command line: qgis. This starts up the application. If you look for a couple of ESRI shape files, you can load them up and play around with them to see how things work. For the open street map data we have downloaded, we can connect to the database using "Add Postgis Layer". This allows you to select the host, database and tables to load in. It takes a while to get data out of the database, but eventually you get there and it shows all of the Benelux in a particularly bad colorful display :).

A better way to get your data out is by using multiple layers, instead of one. There are a set of utilities that can be useful to load ESRI shapes into the database, but also to get them out. Since we can use where statements, disjunctions and conjunctions in SQL, it is simple to pick out what you're looking for, put this into a shape file and load it into qgis for visualization. Note that the polygons are great for showing a bit of volume and color and that lines are more useful for borders:

pgsql2shp -f forests "select osm_id, landuse, "name", way from planet_osm_polygon where landuse='forest'"
pgsql2shp -f water "select name, way from planet_osm_polygon where \"natural\"='water'"
pgsql2shp -f waterwegen "select name, way from planet_osm_polygon where not waterway is null"
pgsql2shp -f railways "select name, way from planet_osm_line where railway = 'rail'"

These exported shape files can now be imported into QGis. Then adjust the properties per layer, specifying pen for drawing a black line around polygon areas or not (I didn't in this case). The lines only use a pen and don't have fill colors. Match wood / forest areas with green and water with blue, then select particular types for roads and railways and you're set!

Using the Print Composer function in QGis, you can now export what you've been creating. Make sure to use the Add Map button first, then export to either SVG (Inkscape?) or PNG.

Here's a page that explains the features in the database, but note that the features are not consistently used throughout the database, so you should always check your results:

http://wiki.openstreetmap.org/wiki/Map_Features

More detailed zoom-in near Amsterdam area using the same restricted set of layers and a comparison with Google Maps:


Notice how Google maps have painted areas in a more generalized way, whereas the openstreetmap image is still the original ESRI format. It should not be very difficult to start painting images of GIS data in a format like the Google one above, then overlay the original high-resolution data over those images to indicate the position of cafés, cinema's, etc...

Good luck!

Thursday, October 22, 2009

Reasoning with(in) language

Natural language processing is (eventually) very much related to understanding what is being written and (re-)recognising words in the particular semantic contexts that apply. After playing around with the NLTK for a while, I come to realize that the toolkit is much geared towards analysis of specific texts, or helps in defining a EBNF representation of a particular grammar such that a particular category of text can be parsed more successfully or specific parsers / analyzers can be researched. There isn't any mention I've seen where a generic classifier (NP/VP/DET) exists that understands what words are for, more or less like an incremental learner that just goes along and finds out what goes where. Discovering the requirements for such a process is the real question. What makes up language? Why can language appear so fluid and in so many forms and how come we recognize language so very quickly after the utterance, even though that particular utterance has likely not been spoken before? Or is there indeed a difference in the speed of interpretation of familiar utterances versus non-familiar utterances? That is a very interesting research question. For now, I'm thinking up whether there are ways to discover the semantic information of words automatically, or possibly let the computer express to us that something could not be recognized or didn't make sense, so that we could tell the computer how things actually worked.

One thing that I find is going to take up a lot of time is to explain to computers how the world works. The advantage we have is the number of senses, which tells us a lot more of the combinations of observations, how these join together (specifying the situation in greater detail) and the possible consequences that could ensue (whether it's danger or normal, etc). Our language also contains words that allow us to express the particular information about observations of the senses into great detail.

What is needed for computers to start learning is to define a generic model for the world, such that objects (instances of types or classes) can interact with other objects (thereby classes), so that by observation of a particular instance, the computer must be able to generalize that instance towards a class or a class higher up the tree, continuing to find consistencies or inconsistencies in that existence. This yields a number of questions that require the computer / agent to research them for truth or false. The result could be that the class that was initially constructed isn't entirely valid, or the instance belongs to a totally different class that was not known until that time.

Most models I've seen revolve around recognizing a stream of information that refers to elements in the world and a pre-designed reasoning model that the computer uses to use those observations appropriately. But that requires up-front design and then rules out the learning of computers more or less automatically. What designers mean with "learning machines" sometimes is not learning new concepts, but most of the times it is related to learning to recognize particular situations such that the corresponding actions (which are finitely defined in most computer programs) may be chosen.

Well, I have no idea about what this model should look like, but there is no reason why it could not look like a generic model setup for multi-agent worlds. That particular situation has relations, which describe possible relations that one entity/instance could have with another and in what way. Then there are functions, which could be thought of as actions or manipulators of instances, functions that do stuff to entities. Of course, these functions shouldn't directly issue a particular action, but only manipulate the object "in the mind's eye". So there's a clear distinction between observation and the observations made about objects in the real world and this real situation vs. the representation of those objects in working memory. Only when conclusions made in working memory make sense should the computer start a 'world action' by invoking some kind of control function.

The picture above was inspired by the action of 'pointing'. Looking at animals in the kingdom there are no animals that actually point to things to teach something to others (well, disregarding the pointer dogs :). So pointing is very specific to the human learning experience and probably the one that has allowed us to learn in the first place. Pointing is about attracting a person's attention towards something that is happen, since it is considered important for learning or for other people to start taking some kind of action. So it's about teaching, but also about group behaviour and a simple way of communication. It is quite remarkable that no other animal has developed this particular trait, being such a simple one.

The expectation for many programs is that they are complete in the sense that basic functionality should work once things are bootstrapped or "learned". With that I mean that the capabilities of the program are pre-determined and programmed in, it only needs to find out when to execute those actions, which is generally determined by inspecting a pre-determined stream of information and then either learning after which sequence of patterns it needs to invoke which action and so on.

It is quite interesting how we neglect the fact that we could teach the program things as well by telling it. And even more importantly, how little research there has been so far about computer programs determining that their knowledge is insufficient to solve the case at hand and for methods to complete that knowledge such that it can be executed in the future. That to me, means that we're still thinking of the computer program as a complete specification with prior requirements that is executing a particular job.

Saturday, October 17, 2009

Natural Language ToolKit

Due to my interest in the z-machine, I'm looking at natural language parsing and the complexities that entail it. There's actually a very nice python-based project that allows one to study natural language processing, it's called nltk. NLTK is more like a set of tools for getting frequency distributions, frequency plots, extracting information, processing raw text, etc. Basically, within a single line of text you can specify a lot of characteristics about the text or words you're interested in, then you run a function from words within another selected set and you get the results you're looking for (at least, that's the idea). There are a lot of Python functions and objects prebuilt into the toolkit offering a lot of generic tools that you'd typically use and to which you can feed specific sets of data for processing or parametrize with your specific intention. It's probably not that effective immediately for a particular application, but this is research and first you need to find out what to do before you start off working on some solution that you think might work. It's all about getting really deep into the matter very quickly, experimenting things, looking at results. I haven't worked extensively with Python, but neither Java nor C allows for this enormously compact syntax for querying and manipulating data sets. Because the number of functions are pretty large, it may be a bit daunting to find out what objects or functions there are, or what they do. Therefore, I suggest to install Eclipse or another IDE and run pydev within that for code completion purposes.

Get the latest distribution of NLTK from here. You can install NLTK on Ubuntu as follows:
$ sudo -s
# apt-get install python-numpy python-matplotlib prover9
# unzip nltk-2.0b3.zip
# cd nltk-2.0b3/
# sudo python setup.py install
# python
Python 2.6...... (......)
>>> import nltk
>>> nltk.download()
The following shows a cumulative frequency plot of the words that occurred most often first:
There are many other things that can be done. This is a textual example of collocations in the "Inaugural Address". Collocations are words that appear together frequently.
>>> text7.collocations()
Building collocations list
million *U*; New York; billion *U*; Wall Street; program trading; Mrs.
Yeargin; vice president; Stock Exchange; Big Board; Georgia Gulf;
chief executive; Dow Jones; S&P 500; says *T*-1; York Stock; last
year; Sea Containers; South Korea; American Express; San Francisco
These examples are taken from the book of NLTK. So, NLTK isn't really about making text accessible or some kind of engine that you can use for parsing / understanding text. It's a toolkit for language processing experimentation, so that using that knowledge you can roll your own stuff afterwards. A specific design goal is low-threshold access to the tools and functions of the toolkit. This should allow people without programming experience to use machine processing tasks for their own research. This hopefully puts it well within reach for anyone looking into natural language processing, filtering spam, building search engines, building translation engines and so forth.

Wednesday, October 07, 2009

Rolling your own heap manager in Linux

I was looking at ways to maintain a lot of data in memory, which may get modified in any way through constant interaction and then methods to maintain that state across process invocations. One of the ways to do this is by picking out structures and objects and store them elsewhere, but that involves a lot of work. Since the process I am working has full knowledge of what happens inside it, I've considered the possibility to dump an entire data heap to disk, so that it can be read in later and immediately reused.

One of the things that you can't do in that case is maintain file descriptors or other state-bound items that are tied to sockets and so on. So, the heap may only maintain data and even with state one should be very careful, since state is often bound to the currently executing process. There's ways around those things too however, so I decided to just go along and do this heap writing/reading.

The idea is to set up a 2GB file as the standard and maximum size of a heap and then map the contents of that file into memory using mmap. Let's call this the file heap. The mmap call will automatically increase the memory address space when necessary (read up on brk() and sbrk()). This 2GB looks contiguous to the process (although Linux may have this in totally different pages in physical memory ). The idea thereafter is that the process handling uses different memory pages, storing current state that is related to current handling and that any data modifications are done using the file heap.

So, data -> file heap, any other necessary memory allocations are allocated from the standard glibc heap (using standard malloc() and free() ). This way, any data allocated with a specific s_malloc() or s_free() call will automatically be added to the internal data file in places that are available to it and programming will feel quite natural overall. It's just like dealing with normal memory.

When the program terminates through a standard quit call (or when it catches specific signals), the msync() call is called, syncing all memory changes to the disk file synchronously, the memory is detached and the application exits. This should guarantee ok consistency between runs. Another run attaches to the file and all the data is there in the right place. For now, the application requires that this mmap is attached to the same base address, so that any pointers inside the file still point to something valid later. An alternative is specific linux structs and allocation functions that maintain this householding, but that increases size significantly.

These methods and custom malloc() and free() implementations should allow the application to also do garbage collections, maintain reference counts and other clever stuff that I can't think of right now. The good thing is that this doesn't require the application to keep everything properly aligned, it deals with less complexity. The preallocated heap is then filled up until it's full. Theoretically, it's also a good idea to pre-slice the heap into three different areas and then work with three heaps instead. This means that all three heaps have good knowledge about the maximum size they should be having and they can arrange their memories with their own specialized algorithms.

Tuesday, October 06, 2009

Semantique

I've slowly started on a new experiment related to the other posts. I've read through documents and source code describing how the original adventure games were created. The z-machine is the most well-known specification. In its most basic form, an adventure game is a blob of data intermixed with code. The z-machine, the program that runs the adventure, is nothing more but a host to the adventure game (story file) and after loading the file in, which is part data and part executable, it starts execution. In general, this causes the game to print the current location. All other things are initialized to their general value. I've said that these games also had code in a way. This code is actually pretty neat, because it is using quite simple constructs from the virtual machine execution engine located inside it.

So, the z-machine is a regular virtual machine that can execute opcodes and also find object properties and values and so forth. Because the game is more or less static, in the sense that the objects contained in the game never change, except to the observer, each object remains in the same position in memory. So, you could refer to the east door in room #3 as object 63 for example, and then use that identifier to reason with that object. Each object has a number of attributes that can be set, which allow you to specify if some door is open or closed. There are also user-definable properties per object.

My interest is mostly in the fact that this model isn't widely used in programs. I intend to write a much more basic proof of concept, where I reserve 2G of memory and start loading concepts into this space. Then I can dump the contents of this space to disk and reload it later, where it will have the same state as before (in contrast to pickling or serialization processes on a per-object basis).

Using Ragel, it should not be too difficult to set up a parsing language which can accept standard sentences in close to natural language, such that it manipulates the state of that heap. Perhaps add a property dynamically to some class, or set an attribute of a certain object instance and so forth.

The idea is that the program, which is not in the strictest sense a virtual machine because it does not yet have opcodes, can be used to insert class descriptions at runtime using language descriptions and then using more sentences you could make specific instances of those "things". Then I'd like to make it possible to load "boot programs" that declare things of a certain type and later on the opcodes as a number of simple operations that can be executed, such that the program can start reasoning on a slightly higher level, without having all capabilities of the program pre-programmed and compiled in. This probably requires dynamic compilation, address substitution and so on, but that'll only be fun :).

At the front of this large bit of memory is a large dictionary of words that it understands, just like the z-machine. The idea is that these words refer to classes, instances or relations, describing their use and their meaning. Then when a word is seen in a stream later, it should be possible to dynamically resolve what should be happening and how the environment changes by looking up the knowledge links, enforcing the restrictions by nouns (an object/class's capability) and the intention of an action (the verb). This probably requires the coding of a couple of rules, but those rules are previously declared and dynamically compiled into another base and referred to as well.

As soon as the dynamic execution environment, programmed and thus driven through natural language, encounters an error, it will report the error in a human-friendly format (because it has access to property descriptions and other natural language stuff inserted earlier). This program won't be as powerful as a real programming language, but I'm only interested in its behaviour for reasoning. If you're interested, you should have a look at Appendix A of the standard rules zip file on this page, that gives an overview how I imagine rules and code is going to be done inside this application.

So in short, the z-machine and adventure games are used as inspiration for developing an environment in which (restricted) natural language is going to drive the logic and what is happening. It will still depend on the "program" (the sentences fed to it) what the environment eventually does (well, I think?), so theoretically it can also be used as a chatbot, or perhaps even a reasoning environment for scheduling problems....? Time will tell!

Sunday, October 04, 2009

On reasoning with ontologies in AI

The image here shows a very simple ontology of cities and countries. A larger graph of one such as the left can be used to make assertions about the location of cities and countries and the sizes of cities. I'll use the graph to show how artificial intelligence is much about reasoning and traversing within graphs. One of the techniques much referenced is "state-space search". Now, that sounds as if someone found a way to make a computer really smart, but once you start working out the details, you realize that such graphs are predefined either statically or dynamically. The static graphs are loaded from some datastore somewhere, the dynamic graphs (probably the usual kind) are the graphs that are dynamically built. A big problem in AI for graph searches is actually knowing which branches to expand, because ideally you expand only those branches with solutions on them. Expanding any others is a waste of time and effort. Currently, AI uses heuristics to approach that problem.

Graphs may also be used in knowledge based systems. The idea is that you insert a lot of knowledge in the form of statements (facts) and that you let the system derive new facts on what is given. So, for example, if you say that a Chicago is a city, and that in cities there live people, then it means that people live in Chicago, so any questions relating to "do people live in Chicago?" can be answered. When you approach the boundaries of the scope of the knowledge, the performance degrades considerably.

So what exactly is reasoning? If we look back at the adventure game of Zork, you could consider the flow of an adventure game as a finite state machine:


In adventure games, the players are reasoners and the pre-defined stories are basically FSM's, created by programmers or story writers. Another way of describing the objective of an adventure game is to find the sequence of commands that leads to a successful finish state. Some game commands don't change the game state, but instead inspect the value of some property or attribute (commands like look, listen, smell, etc.) and return a pre-configured string. Succesful manipulative commands always change the game state, but not necessarily in a way that is useful to get to the end of the game. For example, you could open / close a milk bottle in the fridge repeatedly without having any effect towards reaching the objective. For brevity, any commands that change the world state to something non useful are not visualized in the above graph.

Reasoning in the above graph is really easy, because you simply have to follow the graph from start->finish or vice versa and you solved the problem. It's a single path here. End of complexity?! . Actually, real problems are much more complicated than that, because they have different states that are modified at different moments. To put all the information into a single graph would create one that is too complicated to handle. A state machine can and should handle only one specific memory location (the integer describing the current state) and then the state machine should be accompanied with rules on how that state may be changed. The following graph shows three different state machines, where some transitions contain particular restrictions on when the transition can be executed successfully and when it can't.

We've just made a distinction per state machine based on the observed value of the memory location. So, if FSM_1 governs the locations, then each location is described by a single number. If FSM_2 governs having the key or not, then 6 means not having the key and 7 means having the key.

Reasoning has just become a bit harder in this case. Instead of just finding the reverse path, any reasoning program must start with a potential path that may be possible and then for each transition find out if there are constraints along that transition that might prohibit that change. Further complications arise when a given constraint is based on an event very early in the game, so following a path that eventually fails is very costly. This means that the heuristics for evaluating if the value may ever change in the desired direction becomes of great importance.

The above also shows how software is written. The software specification comes up with a set of rules, which are essentially dynamic generators for several state diagrams. The programmer adds his own variables to that. The programmer then glues various state diagrams together through "if" statements, making the diagrams depend on one another. Then add in a couple of state modifiers (which move one machine from state A -> B) and the party is complete.

Reasoning in this model means generating a deconstruction of the program or specification and inverting it. So a computer that can reason about programs should be able to look at the debug info of a computer program and make statements about how to achieve goal states. A bit like:
  1. For your program to return with code 2, the following paths are possible: "x,y,z,...".
  2. For path 'x' to be chosen, variable i must be equal to 3, j must contain "bird" and k must be less than 5.
  3. For path 'y' to be successful, .....
This shows that in step 2, the reasoning mechanism establishes sub-goals from which reasoning needs to continue. In this case, the sub goals refer to some location in memory and a desirable value for that location. Let's assume that k should never be less than 5, because the knowledge base dictates that it is meaningless (the program contains an error in that case to compare the value to something meaningless). The sub-goal will then disappear from the board altogether, leaving only i or j as possible candidates. Reasoning becomes easier if the reasoning system has access to the inverted program, which I imagine to be a description of some program in reverse.

A genuinely new look at computer-based reasoning could research how conditional jumps are used and whether they are useful for a basis of reasoning. Can they be used to model the complexities of actual reasoning itself? The idea is that without looking at the contents of fields and understanding the semantics, as long as we understand the structure of a state machine (the way how states change), we may be able to reason with it. After all, a name to refer to something is just a pointer to some memory about the properties and classification of such a thing. Just like a computer may have a pointer to a piece of memory that has other pointers to other capabilities.

But what if we only have a start and an end goal and a very large ontology? Then we don't have a program to analyze, only fragments of diagrams that are not connected to one another. The added complexity here is that we should use the rules within the ontology to find ways to connect ontologies and subdiagrams together such that it still makes sense.

Is this different from other logic languages like prolog? Hmmm.. I don't know yet. A good test would be to see if there are different consequences of using such a model. Most programming languages require tests to be written for certain things. And it gets very repetitive to insert lots of facts about the things, plus that an ontology depends a lot on the proper classifications. It's probably related to a different look on things, like complete separation of terms (nouns and verbs) or something.

Saturday, October 03, 2009

Lessons from Zork

If you're in your 30's like me and as a child, you got a Commodore, MSX or one of those other computers in that generation, you might remember playing text-based adventure games. I was hooked on them at age 9 actually and they taught me English (well, and the dictionary did :). Because of those games, I scored 9/10 for all English classes. We never heard of the Internet back then, but there were already English-spoken movies with subtitling on TV then, so that helped as well a bit. However, this post isn't meant to recollect those stories, it's about the design of the interpreters in those times. Just recently, I'm very interested in different designs and approaches to parsing, because it's playing a central role to Natural Language Processing and in that sense to Artificial Intelligent programs that can deal with input in natural language. The Zork text-adventure games were the start of a range of games in the genre. And one of its important necessary features is to accept input from human beings giving orders, process it and then reply back with the results to that command.

Zork actually had a very interesting design of handling this. The story itself was loaded into a z-machine, which executed based on user input. If you want to experiment with such interpreters, have a look at frotz. There's actually a very low-key community in the world who's still playing adventure-games or, as they call it now, reading interactive fiction. They publish their work here and there. Here's an archive for reference.

Initially, you may think that writing a work of interactive fiction or text adventure is a lot of coding work, but there's actually a very cool application that you can use to produce them, called inform. And there is no real coding involved, everything is typed in natural language. Since at one time, a computer needs to run your story, the only not so natural thing is that you declare properties and attributes about things that you personally already take for granted. Because to a computer it is not obvious that a bed can support a person and you can enter it, that needs to be declared explicitly in the story (the story eventually becomes the program). Using inform, a lot more people can write stories and adventures.

At this point, it's probably better to show an example:
The troll can be conscious or unconscious.
Check attacking the unconscious troll with something:
say "The unconscious troll cannot defend himself: He dies.";
say "Almost as soon as the troll breathes his last breath, a cloud
of sinister black fog envelops him, and when the fog lifts, the
carcass has disappeared.";
remove the troll from play instead.
So notice how the troll is given attributes and properties and how some sort of reasoning process is inserted into this process just by applying natural language. The sequence of words (the phrase) used to manipulate the behaviour of the story (the program) is fixed. The descriptions, what you put between the double quotes, is just a sequence of characters and remains unintelligible to the computer. The story is then first compiled into a slightly lower-level language, which compilers towards z-code can interpret:
[ TakeDrink;
canteen--;
if (canteen lt 0) {
thirst=0; UpdateThirst();
} else {
thirst=4;
"Better watch for an oasis !";
}
];
This lower-level language is the language in which the older text adventures were coded. They used a lot of variables (often global) and contained very simple code that influenced game play. The limit was that of the programmer basically. It probably feels a bit like scripting.

For inform, by declaring items as a kind (a category of what it is, e.g: The spoon is a thing), they instantly get a couple of capabilities and the program knows what can be and cannot be done to them, because the inform compiler combines it with a default rule set (which may be overridden in the text). The (intended) interaction of the player with its environment is always communicated by verbs and nouns. In the case of inform, the file already has a large number of default rules, actions and behaviours for all these verbs and possible interactions. Per verb, the program maintains a number of rules with regards to impossible actions, or perhaps temporarily impossible actions due to some state in the interpreter (maybe the player should eat spinach first before they can open the door?). For further reading, the architecture of inform is listed here.

The interpreter for these programs executes so called opcodes, just like Java basically, so it is a kind of virtual machine. An interpreter is like an implementation of a virtual processor and memory on your computer that has the ability to do things at a slightly higher level of execution and protection (this is another way of saying that you're grouping cpu opcodes together in a useful way). In contrast to a function or method in a programming language, the interpreter makes no assumptions on the context in which it is used, it's only a dumb operational thing. The function or method for general programming languages exist within a very specific context of application.

Because Zork and the z-machine were initially developed for computers with very limited resources (64K? 128K?), there is a lot of optimization being done to cram the strings together (5-bit) and there are limits set on the memory addresses of the interpreter. Nowadays though, those limits can be relaxed heavily and possibly this could allow some very nice programs to be created using these virtual interpreters. Read on here for more information on these z-machines and their relevance to AI.

The text in the link above mentions Prolog. I've come to both respect and hate prolog. Prolog is also an interpreter, but works with a very general-purpose language and is very much related to programming in the general sense. Prolog always attempts to prove that your statements unify to something (become true) and does all in its efforts to make it so. It is strongly related to first-order predicate logic and research that entails it.

Now... you may know that when the program behaviour needs to be extended or modified, you generally need to stop the program, change the sources, recompile and then kill the running application and restart with the modified version. In that sense, a program on a computer is static. It was never considered in the architecture of hardware and software that programs needed to change at runtime. This is because they were executing tasks that were entirely thought out before the implementation. Then the implementation takes place, you run it and it can take care of business for a while.

An interpreter is an ideal environment for running experiments where the program can change at runtime and must be able to modify its behaviour and view on the world. If this interpreter can connect to programs online, the natural language can also be used as a means to communicate ontologies with other programs or humans, much in the same way that the above declarations in adventure games are used to extend the ontology of the adventure game (since that is what it essentially does). The difference is that the ontology is compiled beforehand and from that point onwards becomes static.

The bad thing about AI is that it's feeling the pull of the initial computer years and the way how we think about computers, or actually consider what processing is. To most computer-savvy people, it's the most normal thing to kill applications, recode them, restart them and then observe what it does. I think we should probably regard computers slightly differently to make more progress in AI.

Possibly because of the argument above, ontologies for the web are also mostly expressed as some sort of static files. The idea is that knowledge is temporarily static and then doesn't change? Also, even though they look meaningful to us (although interspersed with nonsense brackets and dashes and other signs), the computer just sees xyzxyzxyz and absdfoip and it's all nonsense to it, except the way in which it appears and can be recognized later. It's the reasoning over those forms of appearance that it can pretend to be processing them semantically. The gotcha is in the fact that when we look at the files, they look meaningful, but that's because the words give us a short replay of the vision, hearing or feeling related to the terms.

The truth is that the computer has no knowledge at all about the meaning of the terms and just executes some other code when it sees xyzxyzxyz again or when it knows that abcabcabc was in one way related to xyzxyzxyz. If you want to do yourself a favour to understand ontologies properly, recode the entire ontology in such nonsense terms and it becomes clear that it's easy to be misguided about how much computers really know.

Conclusion: The design of the z-machine interpreter is a very interesting concept for further AI research. One should not only consider interpreters to be relevant at runtime, but also integrate interpreters with compiler constructs, such that a phrase of input doesn't necessarily only modify the state of the interpreter, but may also modify the program itself (adding knowledge to the knowledge base). This allows one machine to talk to another using natural language (easier to debug) and it requires only one interface implementation, since humans use the same channel of input. The interpreter should also have the requirement that it can sync its state to disk and start up later with that memory load, such that it can continue executing from where it left off the last time. An adventure game interpreter is coded with a goal in mind and that is executing until somehow it reaches an end state (you could visualize the interpreter as a markov chain and even as a finite state machine), where knowledge is fixed and the transitions of one knowledge element to another or one contextual state to another is determined by rules.

Ongoing question: Now, this gives us one interesting question to ponder over next: for an environment in which knowledge may be modified and received and states relating to that knowledge manipulated (the context), who or what will set the goals for this interpreter to achieve and what do those goals look like? Can the interpreter determine its own goals and negotiate with us or other computers to try to achieve them?

Thursday, October 01, 2009

CSV file parsing with ragel

I wanted to get my feet wet a bit more with Ragel to get acquainted with the ways it works. Some good examples demonstrating the syntax are here. It's definitely a very impressive piece of software. Setting up a toy language is a bit too much work for just toying around, so I decided to find out how to parse a CSV file. I started out with the sample to do parameter parsing from the command line and adjusted it to read a 3-column CSV file. It's probably not the most efficient code and leaves things to be improved:


#include
#include
#include

#define MAX_BUF_LEN 1023

struct csvline {
char *f1;
char *f2;
int f3;
};

struct csvdata {
int cs;
int buflen;
char buffer[ MAX_BUF_LEN + 1 ];
int field;
struct csvline line;
};

void print_data( struct csvdata *data, int lineno );

%%{
machine csv;
access data->;

# Append to the buffer.
action append {
if ( data->buflen < MAX_BUF_LEN )
data->buffer[ data->buflen++ ] = fc;
}

# Terminate a buffer.
action term {
if ( data->buflen < MAX_BUF_LEN )
data->buffer[ data->buflen++ ] = 0;

switch( data->field ) {
case 0:
data->line.f1 = (char *)calloc( data->buflen, sizeof( char ) );
strncpy( data->line.f1, data->buffer, data->buflen );
data->field++;
break;
case 1:
data->line.f2 = (char *)calloc( data->buflen, sizeof( char ) );
strncpy( data->line.f2, data->buffer, data->buflen );
data->field++;
break;
case 2:
data->line.f3 = atoi( data->buffer );
data->field++;
break;
default:
// ignore
break;
}
}

# Clear out the buffer
action clear { data->buflen = 0; }

# Helpers that collect strings
LF = "\n";
string = [^,]* >clear $append;
string2 = [^,]* >clear $append %term;
comma = "," %term;
main := ( string comma )+ string2 ? LF;
}%%

%% write data;

void csv_init( struct csvdata *data ) {
data->buflen = 0;
%% write init;
}

void csv_exec( struct csvdata *data, const char *d, int len )
{
const char *p = d;
const char *pe = d + len;

%% write exec;
}

int csv_finish( struct csvdata *data )
{
if ( data->cs == csv_error )
return -1;
if ( data->cs >= csv_first_final )
return 1;
return 0;
}

#define BUFSIZE 2048

int main( int argc, char **argv )
{
struct csvdata csvdata;
FILE *csvfile;
int lineno = 0;
char buf[ MAX_BUF_LEN + 1 ] = {"\0"};

if (( csvfile = fopen( "test.csv", "r" ) ) == NULL ) {
fprintf( stderr, "Could not open file test.csv\n" );
return -1;
}

while ( fgets( buf, MAX_BUF_LEN, csvfile ) != NULL ) {
// One more line to process
memset( &csvdata, 0x00, sizeof( csvdata ));
csv_exec( &csvdata, buf, strlen( buf ) );
if ( csv_finish( &csvdata ) != 1 ) {
fprintf( stderr, "error occurred in line: %d\n", lineno );
} else {
print_data( &csvdata, lineno );
}
lineno++;
}

return 0;
}

void print_data( struct csvdata *data, int lineno ) {
fprintf( stdout, "[line %d] f1: %s", lineno, data->line.f1 );
fprintf( stdout, ", f2: %s", data->line.f2 );
fprintf( stdout, ", f3: %d\n", data->line.f3 );
}


The following test.csv file was used:


test1,test2,4000
more,data,5032335
and,even,111
more,data,1213
errorhere
invalid line


And this is how to compile and visualize:


compile.sh:
-----------

#!/bin/bash

ragel main.rl
ragel -V main.rl > test.dot
gcc -o main main.c

visualize.sh:
-------------

#!/bin/bash

dot -Tpng -otest.png test.dot
eog test.png

Tuesday, September 29, 2009

FSM and ragel

The diagram on the right is a depiction of a state machine to parse command line arguments. I'm looking at ragel lately, because the architecture and design are genuinely compelling. The philosophy and architecture behind it are not necessarily limited to lexing input or protocols (although that is what ragel basically does). I'm looking at this from the perspective of applied research in intelligent agents knowledge base sharing and upgrading. One of the ideas I was having is whether there is a possibility to develop a common knowledge between two computer processes that is not necessarily static (like 'pre-defined'), but whether it may actually have dynamic properties such that it can reason with its internal state and knowledge base to resolve specific dead-ends and so on.

( btw, just inbetween, for an explanation how I post code on blogger without using syntax highlighter: http://kevin-berridge.blogspot.com/2007/08/posting-code-on-blogger.html ).

The above diagram was generated by specifying a sort of language for the command line arguments that the application understands. Language is to be interpreted in the broadest sense of the word. Think of it as any stream of input characters in which you can convey ideas or specifications of actions to undertake.

In the above diagram, a state is reached when the state machine can successfully pick up the next character from the stream. So, the state machine can move to a different state if it finds that the next character in the stream contains that specific symbol. It's a bit like a filter. Some states have multiple exit points (so they can go over a number of transitions), which is fine. The interesting characteristic of ragel in comparison with lexer is that you're both string matching and executing code at the same time. So when using ragel, you get the opportunity to start executing things which at a later point may not be completable because the final part of the input is missing. It takes a bit of programming to either discard the state or use it any way, it's not applicable in every context. I can imagine that if you work on transaction-based systems, you just panicked with these statements :). There, you typically wait until the full request is in, generate a response and wait for the client to actually tell you to commit and do it for real.

Another interesting part of ragel is that it doesn't use glibc or other heavier functions (possibly that much). In the above example, you'd typically use some strXXX function from glibc to find out what the user supplied. You also need to make sure your buffers are correctly set up and you don't go over them (I always use strNXXXX functions just in case I get caught out). ragel on the other hand works on your supplied buffers immediately and uses pointer arithmetic. There are two output modes: table-based, where the transition of one state to another is more of a path description and goto-based.

Goto's should probably be considered evil, but with state machines I'm starting to think that machine intelligence could greatly benefit from execution contexts that can switch very quickly from one state to another. Earlier posts made in 2007 have already rambled on about stackless python and so on. Having a stack that grows infinitely doesn't help much.

Now, in the philosophy of ragel, would there be a possibility to develop an agent language that runs in some kind of engine where the agent would continuously instruct the engine what to execute next? Maybe a thread-based context of instructions could help to make this multi-processing.

In that line of thought, consider that a state is basically an identifiable place or state of mind or state in a computer. Having an apple in your hand could be considered a state. A large problem in AI is how you make computers reason from one state to another. Generally this is done with a pre-defined knowledge base that defines all the rules before anything is executed. Such machines or robots become applicable to one thing only (that what is in the knowledge base) and not much else.

Now, a start state is known, maybe it's idle or maybe it's about being hungry or some pro-active state where the AI is trying to achieve something (possibly governed by some emotion engine?). The interaction of several state machines together would be really interesting here. The idea is to get from "start" to "goal" state. If the computer would simulate in its engine how it could get from start to goal by going over the transitions, then it may be able to find different ways of achieving its objective. If transitions have costs associated with them, then the AI could reason about the best method to achieve the objective.

Taking a transition also means using its internal resources. It isn't necessarily a trivial task. A robot could be in a start state somewhere identifiable in the current space and it may deem that it is necessary to move to another location, thus another state. The current focus is then how to get from start -> location. The transition to do that is movement and movement is concerned about finding a path from start->location, possibly using intermediate states that could be used to achieve it. If the transition finds through observation that everything is blocked, then it may decide to panic or find other ways (more costly?) to attempt.

What is described here is a design for a flexible reasoning engine depending fully on (a combination of) state machines, which execute snippets of code inbetween its reasoning processes. Combine this with a shareable language between other robots and human beings (interactive terminal?) and the computer could start asking questions...:
  1. Q: "how start->goal?"
  2. A: "apply movement transition"
  3. ( robot downloads movement knowledgebase and code? )
A basic scenario involves a monkey, a box, a cage with three prescribed locations and a banana. The objective for the monkey is to grab the banana, which it can only do if the box is in the middle of the room, the monkey is standing on the box and it reaches out to grab the banana. This is a reasoning problem, as the details and specifics of actually executing those actions are of a different domain. Actually, the interesting part would be to communicate to other modules of an AI that something is an objective and leave it to sensors and other stuff to actually carry out the specific task. When those modules all agree the task has been executed, they could communicate this back to the reasoning module, which is now confirmed in the new state.

Ragel doesn't just apply actions when it's doing a transition. It can also do this when leaving states and entering them. This allows for some more flavours of interestingness. The idea is that an AI should be able to dynamically extend its knowledge base (which a couple of implementations do), ideally through communication using a simple, non-ambiguous language to communicate those knowledge gems.

In the example of the monkey above, a goal state could be to "have the banana". The computer then doesn't know how to get into that state, so it needs to understand the differences in the following:
  1. how to grab a banana
  2. how to reach out for a banana
  3. how to climb on a box (and whether it is strong enough to support the monkey)
  4. whether the monkey robot is tall enough to reach the banana without the box
  5. how to move the box around the room (and that the monkey cannot be on the box to do this).
  6. whether the box is in the middle of the room
Using these states, you can draw a state diagram of actions to be executed in a certain order. Eventually, if you leave the reasoning to the computer, it should reach a sequence of actions that is least costly to execute (the shortest way to get there) and that is what it should try.