Sunday, July 15, 2007

Latent Semantic Indexing

LSI (Latent Semantic Indexing) is a technique in computer science for finding certain "latent" information in documents. It's about analyzing semantic space through mathematics and statistics, which discovers semantic relationships between words and passages, however the computer cannot name that particular relationship. Also, actual meaning cannot be derived this way, but it can analyze how one corpus of text relates to another.

LSI creates a very, very large matrix of documents in columns with terms(words) in rows, where cells are occurrences. The to-be compared text is another single-column matrix that is transposed and multiplied with this very large matrix. The result is a couple of numbers that describe relevance, or similarity, both in the semantic space (not just word occurrence).

If you are interested, this tutorial gives a very good review of the technology. Several start-up companies are selling Search Engine Optimisation "solutions" based on LSI, but these are all mostly a fraud:

LSI is an attempt to discover "latent" information in documents in an attempt to make our search engine searches more useful. Semantic search is about searching for meaning, whereas most current search engines use word occurrence search (a very dry method of search). LSI by itself is far from sufficient to even approximate a true semantic search.

I have just played around with this technology using a couple of papers found through Google. LSI Tutorial:

The technology is computationally very intensive (well, since matrix operations are, and the set we are considering is, namely the Internet). If you wanted to use LSI properly, you'd have to index all documents on the Internet first, establish a matrix (that will never fit in memory) with the number of columns equal to the documents you have analyzed and the number of rows to the unique terms (words) you have encountered. Then establish a matrix with your search query that has as many rows as the other matrix. Then transpose and multiply. It's easy to see that this type of processing can't easily be done online for the volume of searches that are taking place.

No comments: