Monday, August 17, 2009

Confabulation and how it relates to SVD?

I've been reading up on this confabulation theory. It's pretty interesting work, the theory, but the practical implementation isn't as nice as the ideas behind the theory :). The use of matrices in the implementation seems to basically make this a problem of statistical analysis. The difference with Bayesian theories and other probability frameworks is that this theory assumes a "causal" relationship between one and the other, not in the way that Bayes does, but through cogency maximalization.

In the end though, and reading the word "matrices" somewhere inbetween, it all sounds a bit like singular value decomposition with a twist, rather than something entirely new. I've been looking for ways to replicate their results somehow. I came up with the following:
  • Using Hadoop as an underlying platform for simple tasks (like counting words, ordering them, etc.), you take out a lot of complexity out of a program. So I'm using Hadoop to count word frequencies and the frequencies that words are seen together and at which distance.
  • Hadoop provides a platform where I'm mostly using files, intermediate results stored in files and reducers to bring back some sanity. If I'd had to program that myself, I would be more concerned about stability and reliability than the core of this little research, statistical analysis (in short).
  • The results are used to train a couple of SVD's รก la Netflix - gradient descent. Because I've got a good 1GB file of frequencies, I don't need to process text anymore, it's all ready for learning. (text processing required 5hrs to process and get the results together. The problem in learning is that the statistical significance is only apparent after the entire lot was processed, or you'd have to heuristically make estimates).
  • The SVD's are about 60M in size. On my machine with memory left, I could theoretically get to 33 SVD's, working together.
  • The pre-processed files allow me to process the entire "frequency" file line by line, one after the other. I post-processed it to take out any words I did not recognize.
  • Since I know the statistical significance of each word, how it relates to another and by what distance, I can just keep on training the features of each SVD. Not knowing that beforehand makes training slightly difficult (doh!).
I also thought about possible results of confabulation if it were implemented as a chat bot. Suppose that the technology were ready to produce sentences as if a chat session was going on. The Turing test is one important test, where judges need to make clear distinctions between humans and computers. Since this is a subjective test, I can't help but think that there's also an element of expectation involved. The judges know what a computer does, how it operates and what its limitations are. Now, if someone invents something really cool that breaks that barrier, the judges may temporarily be baffled and come to incorrect conclusions. But as soon as they learn about the research, they may adjust their expectations, raising the bar for the future.

In short, *can* the Turing test ever succeed? Because what if a computer OVER-performs, giving away its true nature? It seems that for a Turing test to succeed, the computer must be on the exact cognitive level of a human being, not higher or lower.

Anyway... Hadoop is really cool and useful. Having another computer in the network would half processing times. Hadoop + custom programs + post-processing scripts sound like really useful ways to pre-process huge amounts of text before you let a machine learn days on end. Since the end product are files that are in a more "edible" format, it should be a lot faster to do the research cycle.


cmunezero said...

with Hadoop + gradient descent, how many MapReduce's are you using for a single iteration? I've been trying to implement gradient descent in hadoop but my current method takes two MapReduce's:
1) Map: calculate the change for both user-feature and item-feature pair
Reduce: calculate sum of changes
2) Map: merge sums calculate above with original user/item ratings
Reduce: apply sums to each user/item rating by updating appropriate feature and use output of this reduce for next iteration

this two step approach slows down the iteration phase and was wondering if it could be done in one step?

Gerard Toonstra said...

I'm not using Hadoop for the gradient descent itself. Hadoop is great for volume data processing, but individual tasks can take some time to be allocated when used in a cluster.

For gradient descent in this particular problem, I didn't have numbers available about specific distances in words. The source file was 5 GB or so, so individual results would have taken up a lot of space. Hadoop goes through pieces of the file(s) to calculate the word distances, then those results are used in probability calculations (actually, cogency).

The problem was that for probability or cogency, you need the total statistics of a word and their distances and that is not available on the first run. So I dumped that into a separate file and then only took out the ones that make up valid words.

This problem sounds like the old netflix one. Gradient descent is iterative and you're processing one iteration in one map. If your idea is to average out the feature increments or decrements, why not do the iteration inside the map phase? In the final reduce phase, you need to know how many times a feature was touched and how significant that is in comparison with others. So you may have to calculate some statistics first before you start off, or store that within the file itself.

Having said that, I don't think Hadoop is a good solution for calculations. I'd consider ways of preprocessing data to get it ready for a heavy calculation stage, but not for doing the calculations iteratively over Hadoop sessions.