Wednesday, November 11, 2009

Netflix prize revisited

The netflix prize has been over for some months. I ended up 215th in the high score table. Considering my involvement over time, it's probably where I should have ended up. Other groups have been at the prize for three years and rightfully ended up in the top 40.

If you still have a research interest in the Netflix prize, you can still download the data from the UCI Machine Learning Repository. This set also includes the actual ratings that were used to rate the submissions. It also includes the winning submission set.

A very interesting implementation for solving this problem is the PyFlix implementation. This implementation requires the downloaded data, but converts it into a binary database that is mmap-ed later into the process space. That's quite clever, because using mmap you can also dump some index or mmap some index once you need it. Certainly, you shouldn't do this within loops, but it goes to show that for performance computing, using files is generally a much better approach than databases. Of course you need to work on the files to get them to do something for you.

If you look at the Basic Usage of the Trac page of PyFlix, you'll see some examples of the very nice interface that was built on top of the data. Now, researchers often have not so much data to proof their concepts and the interface built on top of the netflix data in such a way is remarkably elegant for looking into similarities and trying out a couple of concepts from the python interpreter directly.

I have found however that a production implementation of svd or any other algorithm isn't truly viable in Python because of the CPU constraints and the overhead in a number of things. One of those things being the following flags (gcc) for the binaries built on my machine:
-Wall -march=core2 -mtune=core2 -O2
These show that the binaries to be built are tuned specifically to my processor that I am running, instead of generic "i386" architectures. I'm also no expert on Python, so there may be many ways to totally optimize python such that it runs much better. The flags above will generate code that is much more efficient and reduces a single cycle of 27 seconds on generic i386 architectures to 7 secs in total.

Although programmers don't worry about memory in general that much, since they don't need to (readibility of code and other quality attributes need to be paid attention to as well), for certain loops that are executed a very large number of times (in the millions), it becomes much more important to focus on the actual performance of that loop. This is a very nice article about memory, written by the maintainer of the glibc library of Linux. The glibc library is the glue between the kernel and your application basically and it has a number of handy low-level utility functions that every application uses (like strlen, strstr, etc.).

One of the important aspects of maintaining performance is trying to sort data (if possible functionally and technically), such that the processor doesn't get a cache miss to acquire the data. It will then be much quicker in those kinds of loops. Another kind of performance hog is the ordering of data, for example:
my2DArray[ ROWS ][ COLS ];
When cycling through this loop, you'll want to do this by rows first, then cycle on the columns. The columns are probably linearly aligned in memory, one after the other. So you'd typically iterate through this array through:
for i = 0 to ROWS:
for j = 0 to COLS:
my2DArray[ i ][ j ]
Compare that to:
for j = 0 to COLS:
for i = 0 to ROWS:
my2DArray[ i ][ j ]
The second has cache misses all over the place, which is very bad for performance. Ideally, for computational tasks you find an algorithm that both keeps data close in the processor cache as long as possible, but of course only if that is feasible.

The implementation of pyflix is sneakily reading from disk and doing quite a bit of things in the background for every iteration of the rating loop. This is severely hurting performance in the long run. The good thing is that there's a very elegant API to access the data for other purposes and this API does include a rather fast index. It's as if a very tiny little specific database engine was written to access the data, which is a remarkable and impressive feat by itself!

1 comment:

zubehör said...

This legendary TV miniseries, based on Evelyn Waugh's classic novel of romantic yearning and loss, is compiled in a special, digitally remastered collection. Set between the wars in the glittering yet fading world of the British aristocracy, the series stars the astonishing trio of Jeremy Irons, Anthony Andrews and Diana Quick and features stunning performances by Sir John Gielgud, Claire Bloom and Sir Laurence Olivier.