Saturday, October 17, 2009

Natural Language ToolKit

Due to my interest in the z-machine, I'm looking at natural language parsing and the complexities that entail it. There's actually a very nice python-based project that allows one to study natural language processing, it's called nltk. NLTK is more like a set of tools for getting frequency distributions, frequency plots, extracting information, processing raw text, etc. Basically, within a single line of text you can specify a lot of characteristics about the text or words you're interested in, then you run a function from words within another selected set and you get the results you're looking for (at least, that's the idea). There are a lot of Python functions and objects prebuilt into the toolkit offering a lot of generic tools that you'd typically use and to which you can feed specific sets of data for processing or parametrize with your specific intention. It's probably not that effective immediately for a particular application, but this is research and first you need to find out what to do before you start off working on some solution that you think might work. It's all about getting really deep into the matter very quickly, experimenting things, looking at results. I haven't worked extensively with Python, but neither Java nor C allows for this enormously compact syntax for querying and manipulating data sets. Because the number of functions are pretty large, it may be a bit daunting to find out what objects or functions there are, or what they do. Therefore, I suggest to install Eclipse or another IDE and run pydev within that for code completion purposes.

Get the latest distribution of NLTK from here. You can install NLTK on Ubuntu as follows:
$ sudo -s
# apt-get install python-numpy python-matplotlib prover9
# unzip nltk-2.0b3.zip
# cd nltk-2.0b3/
# sudo python setup.py install
# python
Python 2.6...... (......)
>>> import nltk
>>> nltk.download()
The following shows a cumulative frequency plot of the words that occurred most often first:
There are many other things that can be done. This is a textual example of collocations in the "Inaugural Address". Collocations are words that appear together frequently.
>>> text7.collocations()
Building collocations list
million *U*; New York; billion *U*; Wall Street; program trading; Mrs.
Yeargin; vice president; Stock Exchange; Big Board; Georgia Gulf;
chief executive; Dow Jones; S&P 500; says *T*-1; York Stock; last
year; Sea Containers; South Korea; American Express; San Francisco
These examples are taken from the book of NLTK. So, NLTK isn't really about making text accessible or some kind of engine that you can use for parsing / understanding text. It's a toolkit for language processing experimentation, so that using that knowledge you can roll your own stuff afterwards. A specific design goal is low-threshold access to the tools and functions of the toolkit. This should allow people without programming experience to use machine processing tasks for their own research. This hopefully puts it well within reach for anyone looking into natural language processing, filtering spam, building search engines, building translation engines and so forth.

No comments: