In the end though, and reading the word "matrices" somewhere inbetween, it all sounds a bit like singular value decomposition with a twist, rather than something entirely new. I've been looking for ways to replicate their results somehow. I came up with the following:
- Using Hadoop as an underlying platform for simple tasks (like counting words, ordering them, etc.), you take out a lot of complexity out of a program. So I'm using Hadoop to count word frequencies and the frequencies that words are seen together and at which distance.
- Hadoop provides a platform where I'm mostly using files, intermediate results stored in files and reducers to bring back some sanity. If I'd had to program that myself, I would be more concerned about stability and reliability than the core of this little research, statistical analysis (in short).
- The results are used to train a couple of SVD's á la Netflix - gradient descent. Because I've got a good 1GB file of frequencies, I don't need to process text anymore, it's all ready for learning. (text processing required 5hrs to process and get the results together. The problem in learning is that the statistical significance is only apparent after the entire lot was processed, or you'd have to heuristically make estimates).
- The SVD's are about 60M in size. On my machine with memory left, I could theoretically get to 33 SVD's, working together.
- The pre-processed files allow me to process the entire "frequency" file line by line, one after the other. I post-processed it to take out any words I did not recognize.
- Since I know the statistical significance of each word, how it relates to another and by what distance, I can just keep on training the features of each SVD. Not knowing that beforehand makes training slightly difficult (doh!).
In short, *can* the Turing test ever succeed? Because what if a computer OVER-performs, giving away its true nature? It seems that for a Turing test to succeed, the computer must be on the exact cognitive level of a human being, not higher or lower.
Anyway... Hadoop is really cool and useful. Having another computer in the network would half processing times. Hadoop + custom programs + post-processing scripts sound like really useful ways to pre-process huge amounts of text before you let a machine learn days on end. Since the end product are files that are in a more "edible" format, it should be a lot faster to do the research cycle.