To help analyse my exploratory interviews into how parents choose schools for their children I’m teaching myself how to implement text processing in the Python programming language. My objective is to apply Latent Semantic Analysis to the interviews to help understand how preferences are formed, relationships between these preferences and whether there is any linguistic grouping of preference strength indicated by their semantic distance to choice groups.
However, I need to undertake some text pre-processing for these analyses to work. The pre-processing stages are:
- Remove punctuation
- Make all words lower case (so processed as same word irrespective of case)
- Remove stopwords (common words as “in”, “at”, “the” etc)
- Convert words to their root lemmas (eg. “run”, “ran”, “runs”, “running” to “run”)
Lemma processing is important for latent semantic analyses of relatively small text sizes of 10,000 to 100,000 words compared to the more usual many millions of words involved in web search algorithms. For text sizes in the many millions of words Latent Semantic Analysis will do the equivalent of ‘lemmerizing’ as part of its ‘similarity’ processing.
I tried to use the ‘standard’ lemma dictionaries but found that I needed to categorise words more specifically such as grouping all the different sports together as ‘sport’, all the languages as ‘language’, all musical instruments as ‘music’ etc. There are also words that are similar in writing but not in meaning that will be stemmed in a way that is inappropriate. For example ‘certain’ & ‘certainly’, and ‘actual’ & ‘actually’ are not similar in the context of decision making regarding school choice interviews. Even though linguistically they do have the same roots. So in the end I created a dictionary that allowed me to categorise words in a way that was more meaningful and concise for the context of the analysis.
The best everyday example of how latent semantic analysis is used is Google. Type in “run” and the search will also pick up “ran”, “runs” and “running”.
An example of my custom dictionary for lemmatising school choice text is provided here.