Using linguistic analysis to understand how parents choose schools for their children

In economics, there is limited use of linguistic analysis to understand decision making processes and the contextual relationship between preferences.  Over the last 6 months I have undertaken field research to understand how parents choose a school for their children and the decision architecture associated with this choice.  The objective was not simply to collect information about stated preferences per se, but to understand the complexity of the decision process.   I collected 22 exploratory interviews from Melbourne and regional Victorian parents – with a reasonable level of diversity in family demographics – looking at how they approach the problem of choosing a school for their children.

The purpose of these interviews was to principally explore for interesting economic ideas and questions arising from field observations.  The intent was not to achieve a statistically robust collection of interviews of limited scope but instead to explore for opportunities that would warrant targeted econometric, experimental or theoretical research in the later part of my PhD.   The presentation I gave at the 2014 ‘Cooperation and conflict in the family’ conference on an intergenerational discount heuristic is one of the ideas that arose from these field observations/interviews.

Continue reading

Python Project – linguistic analysis of interviews

To help analyse my exploratory interviews into how parents choose schools for their children I’m teaching myself how to implement text processing in the Python programming language.  My objective is to apply Latent Semantic Analysis to the interviews to help understand how preferences are formed, relationships between these preferences and whether there is any linguistic grouping of preference strength indicated by their semantic distance to choice groups.

However, I need to undertake some text pre-processing for these analyses to work.  The pre-processing stages are:

  1. Remove punctuation
  2. Make all words lower case (so processed as same word irrespective of case)
  3. Remove stopwords (common words as “in”, “at”, “the” etc)
  4. Convert words to their root lemmas (eg. “run”, “ran”, “runs”, “running” to “run”)

Lemma processing is important for latent semantic analyses of relatively small text sizes of 10,000 to 100,000 words compared to the more usual many millions of words involved in web search algorithms.  For text sizes in the many millions of words Latent Semantic Analysis will do the equivalent of ‘lemmerizing’ as part of its ‘similarity’ processing.

I tried to use the ‘standard’ lemma dictionaries but found that I needed to categorise words more specifically such as grouping all the different sports together as ‘sport’, all the languages as ‘language’, all musical instruments as ‘music’ etc.   There are also words that are similar in writing but not in meaning that will be stemmed in a way that is inappropriate.  For example ‘certain’ & ‘certainly’, and ‘actual’ & ‘actually’ are not similar in the context of decision making regarding school choice interviews.  Even though linguistically they do have the same roots.  So in the end I created a dictionary that allowed me to categorise words in a way that was more meaningful and concise for the context of the analysis.

The best everyday example of how latent semantic analysis is used is Google.  Type in “run” and the search will also pick up “ran”, “runs” and “running”.

An example of my custom dictionary for lemmatising school choice text is provided here.