Monday, July 16, 2007

Latent Semantic Analysis

This is a wonderful explanation of LSA:

http://lsa.colorado.edu/whatis.html

"As a practical method for the statistical characterization of word usage, we know that LSA produces measures of word-word, word-passage and passage-passage relations that are reasonably well correlated with several human cognitive phenomena involving association or semantic similarity. Empirical evidence of this will be reviewed shortly. The correlation must be the result of the way peoples' representation of meaning is reflected in the word choice of writers, and/or vice-versa, that peoples' representations of meaning reflect the statistics of what they have read and heard. LSA allows us to approximate human judgments of overall meaning similarity, estimates of which often figure prominently in research on discourse processing. It is important to note from the start, however, that the similarity estimates derived by LSA are not simple contiguity frequencies or co-occurrence contingencies, but depend on a deeper statistical analysis (thus the term "Latent Semantic"), that is capable of correctly inferring relations beyond first order co-occurrence and, as a consequence, is often a very much better predictor of human meaning-based judgments and performance.

Of course, LSA, as currently practiced, induces its representations of the meaning of words and passages from analysis of text alone. None of its knowledge comes directly from perceptual information about the physical world, from instinct, or from experiential intercourse with bodily functions and feelings. Thus its representation of reality is bound to be somewhat sterile and bloodless."

Having read this from the perspective of inferring meaning from a corpus of text, I think perspectives and statements on the use of LSA or LSI are too positive to become anything truly useful for web search by itself alone.

A philosophical discussion on the meaning of meaning can be useful to understand how meaning is actually represented or can be analyzed. If ever we understand how meaning is derived, it should be possible to generate better approximate (mathematical?) models.

It's very difficult to infer any kind of meaning without having access to the real world the way that humans do. It would be interesting to find out how the world looks like to deaf or blind people. This should give us useful clues on the way a computer is perceiving a corpus of text. Moreover, maybe the way disabled people compensate can be a useful indication for other compensations in LSA or LSI.

It is very interesting though to see how meaning and semantics can be (in limited ways) represented by a mathematical calculation. This begs the question whether the mind itself is a large, very quick and efficient calculator or whether it's depending on certain natural processes. I think personally, as in another post, that the mind does not rely on calculation alone and that the model of a stack-based computer does not even come close to resembling our "internal CPU".

The intricate and complex process of deriving meaning from the environment requires an interaction between memory, interpretation, analysis and emotion. Mapping this to a computer:
  • Memory == RAM and disk, probably very, very large and not always accurately represented (human memory is 'fuzzy')
  • Analysis == Deconstruction of events into smaller parts
  • Interpretation == The idea inferred from the sum of the smaller parts, with extra information added from memory (similar cases)
  • Emotion == A lookup and induction of feelings based on the sum of the smaller parts, that recall certain emotions associated with the (sum of) those events. This is induced feelings when watching/reading a romantic love-story or in other cases levels of stress induced by a previously suffered trauma.
Clearly the computer is missing a lot of information. Besides the problems of Natural Language Processing (variations of meaning "hidden" in the text, where words mean different things, etc.), a poem to a computer is a sterile corpus of text that embodies much less meaning than it does to a human. Without memory and therefore association with similar events, a single corpus of text is empty and out of context.

These realizations lead me to believe that, in order for a semantic search to be really successful, one must replicate people's memories, emotions and contexts and analyze each corpus of text (the Internet) within the context of that particular person. To analyze and consider the whole Internet within the context of individuals is an impossible task. If we do this based on certain profiles, we might be able to execute this.

The ideal situation is the possibility to store "meaning" and not just keywords from a certain corpus of text and only later match this meaning with intention (search). I don't think we are able yet to represent meaning in other ways than text, unless we consider that LSA or LSI are indications of meaning by large arrays of numbers (matrices)?

Ugh! Sounds like LSD might be a better means to approximate meaning :)

No comments: