Wednesday, November 11, 2009

Netflix prize revisited

The netflix prize has been over for some months. I ended up 215th in the high score table. Considering my involvement over time, it's probably where I should have ended up. Other groups have been at the prize for three years and rightfully ended up in the top 40.

If you still have a research interest in the Netflix prize, you can still download the data from the UCI Machine Learning Repository. This set also includes the actual ratings that were used to rate the submissions. It also includes the winning submission set.

A very interesting implementation for solving this problem is the PyFlix implementation. This implementation requires the downloaded data, but converts it into a binary database that is mmap-ed later into the process space. That's quite clever, because using mmap you can also dump some index or mmap some index once you need it. Certainly, you shouldn't do this within loops, but it goes to show that for performance computing, using files is generally a much better approach than databases. Of course you need to work on the files to get them to do something for you.

If you look at the Basic Usage of the Trac page of PyFlix, you'll see some examples of the very nice interface that was built on top of the data. Now, researchers often have not so much data to proof their concepts and the interface built on top of the netflix data in such a way is remarkably elegant for looking into similarities and trying out a couple of concepts from the python interpreter directly.

I have found however that a production implementation of svd or any other algorithm isn't truly viable in Python because of the CPU constraints and the overhead in a number of things. One of those things being the following flags (gcc) for the binaries built on my machine:
-Wall -march=core2 -mtune=core2 -O2
These show that the binaries to be built are tuned specifically to my processor that I am running, instead of generic "i386" architectures. I'm also no expert on Python, so there may be many ways to totally optimize python such that it runs much better. The flags above will generate code that is much more efficient and reduces a single cycle of 27 seconds on generic i386 architectures to 7 secs in total.

Although programmers don't worry about memory in general that much, since they don't need to (readibility of code and other quality attributes need to be paid attention to as well), for certain loops that are executed a very large number of times (in the millions), it becomes much more important to focus on the actual performance of that loop. This is a very nice article about memory, written by the maintainer of the glibc library of Linux. The glibc library is the glue between the kernel and your application basically and it has a number of handy low-level utility functions that every application uses (like strlen, strstr, etc.).

One of the important aspects of maintaining performance is trying to sort data (if possible functionally and technically), such that the processor doesn't get a cache miss to acquire the data. It will then be much quicker in those kinds of loops. Another kind of performance hog is the ordering of data, for example:
my2DArray[ ROWS ][ COLS ];
When cycling through this loop, you'll want to do this by rows first, then cycle on the columns. The columns are probably linearly aligned in memory, one after the other. So you'd typically iterate through this array through:
for i = 0 to ROWS:
for j = 0 to COLS:
my2DArray[ i ][ j ]
Compare that to:
for j = 0 to COLS:
for i = 0 to ROWS:
my2DArray[ i ][ j ]
The second has cache misses all over the place, which is very bad for performance. Ideally, for computational tasks you find an algorithm that both keeps data close in the processor cache as long as possible, but of course only if that is feasible.

The implementation of pyflix is sneakily reading from disk and doing quite a bit of things in the background for every iteration of the rating loop. This is severely hurting performance in the long run. The good thing is that there's a very elegant API to access the data for other purposes and this API does include a rather fast index. It's as if a very tiny little specific database engine was written to access the data, which is a remarkable and impressive feat by itself!

Monday, November 09, 2009

SVD in Python

Here's a small example of Singular Value Decomposition using Python:
from scipy import linalg, mat, dot;
matrix = mat( [[2,1,0,0], [4,3,0,0]] );
print "Original matrix:"
print matrix
U, s, V = linalg.svd( matrix )
print "U:"
print U
print "sigma:"
print s
print "VT:"
print V
dimensions = 1
rows,cols = matrix.shape
#Dimension reduction, build SIGMA'
for index in xrange(dimensions, rows):
s[index]=0
print "reduced sigma:"
print s
#Reconstruct MATRIX'
reconstructedMatrix= dot(dot(U,linalg.diagsvd(s,len(matrix),len(V))),V)
#Print transform
print "reconstructed:"
print reconstructedMatrix
This code prints the following:
Original matrix:
[[2 1 0 0]
[4 3 0 0]]
U:
[[-0.40455358 -0.9145143 ]
[-0.9145143 0.40455358]]
sigma:
[ 5.4649857 0.36596619]
VT:
[[-0.81741556 -0.57604844 0. 0. ]
[-0.57604844 0.81741556 0. 0. ]
[ 0. 0. 1. 0. ]
[ 0. 0. 0. 1. ]]
reduced sigma:
[ 5.4649857 0. ]
reconstructed:
[[ 1.80720735 1.27357371 0. 0. ]
[ 4.08528566 2.87897923 0. 0. ]]
And with one more dimension for sigma:
reduced sigma:
[ 5.4649857 0.36596619]
reconstructed:
[[ 2. 1. 0. 0.]
[ 4. 3. 0. 0.]]
This is how you can use Python for quick tests and experiments on SVD.

Wednesday, November 04, 2009

Abstract thought

Thought... what is it? I've posted before on the topic of consciousness and thought. Without any final conclusion, the topic of thought is discussed in philosophy with differing opinions on the matter. Some say that thought has mystic properties and is only reproducible in biological matters, some in that camp go as far to state that thought is purely human and is what separates us from animals. Could be, since there surely are specific differences in the way we learn, reason and behave, even compared to monkeys. See "Ape Genius" here for example. The last video talks about a very important difference, namely pointing. The questions posed in the research is whether apes are more or less thinking like us or share specific traits that make them behave like us. Looking at the videos and specifically the parts about cooperation and learning, I have personally come to the conclusion that there is not that much in common (the problem is that in a way, apes look more like us than, say, horses, so we're inclined to believe they think like us for that reason. But are they really the same once you completely ignore the 'look-alike'? ). Back to the question at hand... there are other streams in philosophy that believe thought is computational. Then there are once again subdivisions in that region. Some say that the mind is partly computational, but has other traits that are incredibly hard to model and execute on a computer for example.

Scientists now believe that they can recreate thought by replicating neural networks. So the idea is to think of a common task and then proof that this task can be satisfiably executed by an artificial neural network running in a computer. The problem here is that the neural network is trained for a very particular task and there is no reasoning taking place other than the execution of that particular task. So the neural network expects a range of inputs and will calculate the correct output based on those. If the inputs are out of range, the output is not guaranteed to be useful. Also, you will only get a meaningful output for a specific purpose, not an output that is meaningful in different scenarios.

The biggest problem here is that we can't think in sufficiently abstract terms and the relations between those terms. Because we cannot imagine 'pure thought', what it looks like and how it can be alternatively represented, we keep pushing buttons in the hope that somewhere we find an interesting response somewhere that indicates some kind of causality of the external world and internal processing.

In order to simulate thought in a computer, one must assume that thought is purely computational, otherwise the motivation and the execution of the research is contradictory. Pure computational thought requires us to think differently about representations and find other ways to represent parts of the meta-model of the outside world. The world out there has a form and when we see a cat, we don't pick it up, put it in our head and reproduce it later in thought. So, thought requires us to model those things differently in our minds such that they can be reproduced later. Whether this be a word or a number is not truly relevant. The relevance is related to how the relations between these concepts can be maintained. So, reasoning about things isn't exactly about representing the concepts, but about representing the relations between concepts, what things do to one another or how they are typically related.

Singular Value Decomposition, often discussed on this blog within the context of collaborative filtering, has the ability to express patterns of co-occurrence or relations between numbers or items. And here starts the rub. For SVD to be useful, the designer / modeler needs to determine the correct way to put the numbers into the matrix before the calculation is started. The model dictates for example that users go into columns and movies go into rows. Then for each combination, a number is inserted, yielding a huge matrix of interrelations between instances. The interesting thing here is then that one movie relates to many users and one user relates to many movies. So, in essence, the preference of a user is modeled within this matrix and related to the type and characteristics of a movie. In a sense, this means that preference is modeled against characteristics. We don't have any data available about movie characteristics or user preferences directly, but generating this matrix we can suddenly start reasoning with those, although the exact meaning of preferences and meaning of characteristics, appearing as numbers, may not be derive-able.

And here goes... in order to make those preferences and characteristics meaningful, one should have a classification system at hand that can compare classes with one another. Classification means comparing two things to one another and trying to find the most important characteristic that make them match or differ. That operation is different from the calculation performed earlier.

So this goes back to our incapacity to think in truly abstract terms. We can get a feeling for something, but then if it is abstract, can't describe it. Although we are certain about incompatibility, incongruence or similarity for example. A computer model where these abstracts can be manipulated and translated into something meaningful, classified and everything backwards is going to be a very important step.

I think of the brain not as a huge number of neurons that are interconnected, but I think of each neuronal group as some kind of classifier or work item. In that sense, one can question whether it makes sense to simulate 100 billion neurons if the total effect of those biological computations can be simulated more effectively using stricter and cheaper mathematical operations, or a simulation of neuron groups in a neural group network instead, severely reducing the dimensions that are (thought) to be necessary.

This is a great question for research. Can a machine that is constructed from bits, which are 0's and 1's, therefore have no intermediate state, work with numbers and symbols in such a way that it starts to approximate fluid thought?

Sunday, November 01, 2009

Building a mapserver with a karmic koala

I've updated my Linux system to Karmic Koala over the weekend. It seems to work quite well. For the first time, I decided to kill all the binaries that somehow made it to my machine over a course of 2 years and do a fresh binaries install, keeping my home mount with data. That worked out well and the machine even booted my RAID-0 array with dmraid without any problems this time. Ubuntu 9.10 works like a treat, you should try it!


Getting down to business, if you want to find out how Google / TeleAtlas renders maps, here's a page that gives you an idea how the process works. A mapserver is basically an image server with a HUGE, HUGE, HUGE number of tiles behind it. Each zoomlevel maintains its own set of images basically, so that's why adding a zoomlevel to a finer-grained level will be costly space-wise. The tiles are constructed by adding GIS information of different types together from a rather large database. The tutorial that I found here is very easy to follow and the most comprehensive on the subject.

In this tutorial, they end up building a little world map, which I attempted and worked out, as you can see. The world map on the right was constructed by the information of SHAPE files on my server. The overview is a very generic image, but the information in the SHAPE file is so great, that I can zoom in to great extents and produce the complete coastline of Antarctica or any country in Europe. Thanks to the efforts of the Openstreetmaps project that is and people and organizations that collaborate with them. Notice also how the world seems to appear slightly distorted. I think this is related to the chosen projection method.

Well, once that is working, you can load your own spatial data in postgis and postgres and start drawing specific parts of your interest on detailed images. Instead of writing your own programs to do that, just use the utilities and scripts of the mapnik project. An example of that is here, central Amsterdam:

So this is great. I can now produce very detailed streetmaps of any region in Holland, reason about those places and ways through a database with spatial reasoning predicates, find out extents of regions, and so forth. Mapnik also provides scripts to generate tiles from a given 'planet' database. Tiling a country however can produce a very large amount of data on your PC, so use with care :). The above image was produced using standard styling rules. It is possible to adjust this styling or replace it entirely, such that it becomes more personal. These generated images, together with a bit of JavaScript and the original PostGis database as a backup to the locations of points of interest are at the core of Google Maps.

Well, another interesting application of GIS information is by super-imposing data from different sources over the data in the database, or not rendering specific sets of data from the source in the database so that they have less information, making it easier to focus on the important bits. You can see how Bjørn Sandvik made a thematic mapper for Google Earth by generating KML from thematic data merged with (simplified versions of) world boundaries. Although KML takes some time to render, especially when in 3D (he wrote a nice, detailed paper about the techniques though), you can generate 2D images by loading your thematic data in PostGis first, then relating your data rows with the geographical data. Using a clever query and the pgsql2shp tool, it should be possible to output a file with the attributes you require for rendering. The last step is then to spit out an XML rendering file for mapnik, which basically filters your attributes, assigns colors or other styling measures and then run it through the mapnik renderer.

There's lots of things one can do here. Be reminded that dealing with these tools can be a bit daunting at first. There's generally no need to write complicated mergers/processors, because you can use PostGis as an intermediate data store, which can output .shp files (the most portable format I reckon), which other tools can visualize or process further.

Tuesday, October 27, 2009

Drawing your country with openstreetmap

In some previous project, I worked with GIS data to show demographic data on Google Maps. The underlying database is PostGreSQL. This post shows how you can use (part of) the database of OpenStreetMap (verify license!) to extract features from the dataset to paint this onto an SVG or PNG image. Using the PostGIS database in combination with the loaded data, you can extract features (these are like gmaps overlays, or GIS layers) and include them in the image. You want to include railroads and regular roads? Not a problem! You only want to include waterways and the general shape of the country? Can do! The image above was constructed by taking water, waterways, train railroads and forest areas of Holland. Then I zoomed in a bit to show the level of detail.

Since I mostly use Ubuntu, I'll explain the steps used to get things done.

Prerequisites:
  • Postgres 8.3 + Postgres 8.3 server-dev package
  • Postgis
  • cmake
  • qmake
  • libqt4-dev
  • bz2-dev library
  • geos library
  • gdal libraries (libgdal, gdal binaries *and* libgdal1-dev)
  • libxml2-dev header files and library
  • Benelux or other data: ( http://planet.openstreetmap.nl/ ) (no need to unpack! 170M packed, 1.8G unpacked )
Then, you probably need to download the SVN version of a utility called osm2pgsql. This utility allows you to load in the Benelux data into your postgis database. You can get it using:

svn co http://svn.openstreetmap.org/applications/utils/export/osm2pgsql/

Then, make and make install. This probably doesn't copy the default.style across from the svn checkout directory to where it is expected. So:

# mkdir /usr/share/osm2pgsql
# cp default.style /usr/share/osm2pgsql

Now, you're ready to load the Benelux data:

osm2pgsql --slim -c -l -d planet-benelux-latest.osm.gz

Ok, so let's start visualizing this for a bit. You'll want to get qgis, compile that, install it, run it and then connect to your database:

# svn co https://svn.osgeo.org/qgis/branches/Release-1_1_0 qgis_1.1.0
# cd qgis-1.1.0
# mkdir build
# cd build
# ccmake ..
(verify output, resolve errors). Press 'g' to generate scripts

Then you should be able to run qgis from the command line: qgis. This starts up the application. If you look for a couple of ESRI shape files, you can load them up and play around with them to see how things work. For the open street map data we have downloaded, we can connect to the database using "Add Postgis Layer". This allows you to select the host, database and tables to load in. It takes a while to get data out of the database, but eventually you get there and it shows all of the Benelux in a particularly bad colorful display :).

A better way to get your data out is by using multiple layers, instead of one. There are a set of utilities that can be useful to load ESRI shapes into the database, but also to get them out. Since we can use where statements, disjunctions and conjunctions in SQL, it is simple to pick out what you're looking for, put this into a shape file and load it into qgis for visualization. Note that the polygons are great for showing a bit of volume and color and that lines are more useful for borders:

pgsql2shp -f forests "select osm_id, landuse, "name", way from planet_osm_polygon where landuse='forest'"
pgsql2shp -f water "select name, way from planet_osm_polygon where \"natural\"='water'"
pgsql2shp -f waterwegen "select name, way from planet_osm_polygon where not waterway is null"
pgsql2shp -f railways "select name, way from planet_osm_line where railway = 'rail'"

These exported shape files can now be imported into QGis. Then adjust the properties per layer, specifying pen for drawing a black line around polygon areas or not (I didn't in this case). The lines only use a pen and don't have fill colors. Match wood / forest areas with green and water with blue, then select particular types for roads and railways and you're set!

Using the Print Composer function in QGis, you can now export what you've been creating. Make sure to use the Add Map button first, then export to either SVG (Inkscape?) or PNG.

Here's a page that explains the features in the database, but note that the features are not consistently used throughout the database, so you should always check your results:

http://wiki.openstreetmap.org/wiki/Map_Features

More detailed zoom-in near Amsterdam area using the same restricted set of layers and a comparison with Google Maps:


Notice how Google maps have painted areas in a more generalized way, whereas the openstreetmap image is still the original ESRI format. It should not be very difficult to start painting images of GIS data in a format like the Google one above, then overlay the original high-resolution data over those images to indicate the position of cafés, cinema's, etc...

Good luck!

Thursday, October 22, 2009

Reasoning with(in) language

Natural language processing is (eventually) very much related to understanding what is being written and (re-)recognising words in the particular semantic contexts that apply. After playing around with the NLTK for a while, I come to realize that the toolkit is much geared towards analysis of specific texts, or helps in defining a EBNF representation of a particular grammar such that a particular category of text can be parsed more successfully or specific parsers / analyzers can be researched. There isn't any mention I've seen where a generic classifier (NP/VP/DET) exists that understands what words are for, more or less like an incremental learner that just goes along and finds out what goes where. Discovering the requirements for such a process is the real question. What makes up language? Why can language appear so fluid and in so many forms and how come we recognize language so very quickly after the utterance, even though that particular utterance has likely not been spoken before? Or is there indeed a difference in the speed of interpretation of familiar utterances versus non-familiar utterances? That is a very interesting research question. For now, I'm thinking up whether there are ways to discover the semantic information of words automatically, or possibly let the computer express to us that something could not be recognized or didn't make sense, so that we could tell the computer how things actually worked.

One thing that I find is going to take up a lot of time is to explain to computers how the world works. The advantage we have is the number of senses, which tells us a lot more of the combinations of observations, how these join together (specifying the situation in greater detail) and the possible consequences that could ensue (whether it's danger or normal, etc). Our language also contains words that allow us to express the particular information about observations of the senses into great detail.

What is needed for computers to start learning is to define a generic model for the world, such that objects (instances of types or classes) can interact with other objects (thereby classes), so that by observation of a particular instance, the computer must be able to generalize that instance towards a class or a class higher up the tree, continuing to find consistencies or inconsistencies in that existence. This yields a number of questions that require the computer / agent to research them for truth or false. The result could be that the class that was initially constructed isn't entirely valid, or the instance belongs to a totally different class that was not known until that time.

Most models I've seen revolve around recognizing a stream of information that refers to elements in the world and a pre-designed reasoning model that the computer uses to use those observations appropriately. But that requires up-front design and then rules out the learning of computers more or less automatically. What designers mean with "learning machines" sometimes is not learning new concepts, but most of the times it is related to learning to recognize particular situations such that the corresponding actions (which are finitely defined in most computer programs) may be chosen.

Well, I have no idea about what this model should look like, but there is no reason why it could not look like a generic model setup for multi-agent worlds. That particular situation has relations, which describe possible relations that one entity/instance could have with another and in what way. Then there are functions, which could be thought of as actions or manipulators of instances, functions that do stuff to entities. Of course, these functions shouldn't directly issue a particular action, but only manipulate the object "in the mind's eye". So there's a clear distinction between observation and the observations made about objects in the real world and this real situation vs. the representation of those objects in working memory. Only when conclusions made in working memory make sense should the computer start a 'world action' by invoking some kind of control function.

The picture above was inspired by the action of 'pointing'. Looking at animals in the kingdom there are no animals that actually point to things to teach something to others (well, disregarding the pointer dogs :). So pointing is very specific to the human learning experience and probably the one that has allowed us to learn in the first place. Pointing is about attracting a person's attention towards something that is happen, since it is considered important for learning or for other people to start taking some kind of action. So it's about teaching, but also about group behaviour and a simple way of communication. It is quite remarkable that no other animal has developed this particular trait, being such a simple one.

The expectation for many programs is that they are complete in the sense that basic functionality should work once things are bootstrapped or "learned". With that I mean that the capabilities of the program are pre-determined and programmed in, it only needs to find out when to execute those actions, which is generally determined by inspecting a pre-determined stream of information and then either learning after which sequence of patterns it needs to invoke which action and so on.

It is quite interesting how we neglect the fact that we could teach the program things as well by telling it. And even more importantly, how little research there has been so far about computer programs determining that their knowledge is insufficient to solve the case at hand and for methods to complete that knowledge such that it can be executed in the future. That to me, means that we're still thinking of the computer program as a complete specification with prior requirements that is executing a particular job.

Saturday, October 17, 2009

Natural Language ToolKit

Due to my interest in the z-machine, I'm looking at natural language parsing and the complexities that entail it. There's actually a very nice python-based project that allows one to study natural language processing, it's called nltk. NLTK is more like a set of tools for getting frequency distributions, frequency plots, extracting information, processing raw text, etc. Basically, within a single line of text you can specify a lot of characteristics about the text or words you're interested in, then you run a function from words within another selected set and you get the results you're looking for (at least, that's the idea). There are a lot of Python functions and objects prebuilt into the toolkit offering a lot of generic tools that you'd typically use and to which you can feed specific sets of data for processing or parametrize with your specific intention. It's probably not that effective immediately for a particular application, but this is research and first you need to find out what to do before you start off working on some solution that you think might work. It's all about getting really deep into the matter very quickly, experimenting things, looking at results. I haven't worked extensively with Python, but neither Java nor C allows for this enormously compact syntax for querying and manipulating data sets. Because the number of functions are pretty large, it may be a bit daunting to find out what objects or functions there are, or what they do. Therefore, I suggest to install Eclipse or another IDE and run pydev within that for code completion purposes.

Get the latest distribution of NLTK from here. You can install NLTK on Ubuntu as follows:
$ sudo -s
# apt-get install python-numpy python-matplotlib prover9
# unzip nltk-2.0b3.zip
# cd nltk-2.0b3/
# sudo python setup.py install
# python
Python 2.6...... (......)
>>> import nltk
>>> nltk.download()
The following shows a cumulative frequency plot of the words that occurred most often first:
There are many other things that can be done. This is a textual example of collocations in the "Inaugural Address". Collocations are words that appear together frequently.
>>> text7.collocations()
Building collocations list
million *U*; New York; billion *U*; Wall Street; program trading; Mrs.
Yeargin; vice president; Stock Exchange; Big Board; Georgia Gulf;
chief executive; Dow Jones; S&P 500; says *T*-1; York Stock; last
year; Sea Containers; South Korea; American Express; San Francisco
These examples are taken from the book of NLTK. So, NLTK isn't really about making text accessible or some kind of engine that you can use for parsing / understanding text. It's a toolkit for language processing experimentation, so that using that knowledge you can roll your own stuff afterwards. A specific design goal is low-threshold access to the tools and functions of the toolkit. This should allow people without programming experience to use machine processing tasks for their own research. This hopefully puts it well within reach for anyone looking into natural language processing, filtering spam, building search engines, building translation engines and so forth.