In the end though, and reading the word "matrices" somewhere inbetween, it all sounds a bit like singular value decomposition with a twist, rather than something entirely new. I've been looking for ways to replicate their results somehow. I came up with the following:
- Using Hadoop as an underlying platform for simple tasks (like counting words, ordering them, etc.), you take out a lot of complexity out of a program. So I'm using Hadoop to count word frequencies and the frequencies that words are seen together and at which distance.
- Hadoop provides a platform where I'm mostly using files, intermediate results stored in files and reducers to bring back some sanity. If I'd had to program that myself, I would be more concerned about stability and reliability than the core of this little research, statistical analysis (in short).
- The results are used to train a couple of SVD's รก la Netflix - gradient descent. Because I've got a good 1GB file of frequencies, I don't need to process text anymore, it's all ready for learning. (text processing required 5hrs to process and get the results together. The problem in learning is that the statistical significance is only apparent after the entire lot was processed, or you'd have to heuristically make estimates).
- The SVD's are about 60M in size. On my machine with memory left, I could theoretically get to 33 SVD's, working together.
- The pre-processed files allow me to process the entire "frequency" file line by line, one after the other. I post-processed it to take out any words I did not recognize.
- Since I know the statistical significance of each word, how it relates to another and by what distance, I can just keep on training the features of each SVD. Not knowing that beforehand makes training slightly difficult (doh!).
In short, *can* the Turing test ever succeed? Because what if a computer OVER-performs, giving away its true nature? It seems that for a Turing test to succeed, the computer must be on the exact cognitive level of a human being, not higher or lower.
Anyway... Hadoop is really cool and useful. Having another computer in the network would half processing times. Hadoop + custom programs + post-processing scripts sound like really useful ways to pre-process huge amounts of text before you let a machine learn days on end. Since the end product are files that are in a more "edible" format, it should be a lot faster to do the research cycle.
2 comments:
with Hadoop + gradient descent, how many MapReduce's are you using for a single iteration? I've been trying to implement gradient descent in hadoop but my current method takes two MapReduce's:
1) Map: calculate the change for both user-feature and item-feature pair
Reduce: calculate sum of changes
2) Map: merge sums calculate above with original user/item ratings
Reduce: apply sums to each user/item rating by updating appropriate feature and use output of this reduce for next iteration
this two step approach slows down the iteration phase and was wondering if it could be done in one step?
I'm not using Hadoop for the gradient descent itself. Hadoop is great for volume data processing, but individual tasks can take some time to be allocated when used in a cluster.
For gradient descent in this particular problem, I didn't have numbers available about specific distances in words. The source file was 5 GB or so, so individual results would have taken up a lot of space. Hadoop goes through pieces of the file(s) to calculate the word distances, then those results are used in probability calculations (actually, cogency).
The problem was that for probability or cogency, you need the total statistics of a word and their distances and that is not available on the first run. So I dumped that into a separate file and then only took out the ones that make up valid words.
This problem sounds like the old netflix one. Gradient descent is iterative and you're processing one iteration in one map. If your idea is to average out the feature increments or decrements, why not do the iteration inside the map phase? In the final reduce phase, you need to know how many times a feature was touched and how significant that is in comparison with others. So you may have to calculate some statistics first before you start off, or store that within the file itself.
Having said that, I don't think Hadoop is a good solution for calculations. I'd consider ways of preprocessing data to get it ready for a heavy calculation stage, but not for doing the calculations iteratively over Hadoop sessions.
Post a Comment