Compare Word Frequency Between Groups - WordStat Text Analytics
The WordStat text analytics software has many great features that allow you to investigate trends in your text data. A recently added feature to WordStat are Deviation Tables. This is a table that will highlight words that are used more and less frequently depending on the group who wrote them.
For example, you have a set of speeches written for different presidential candidates. You load these speeches into WordStat, assigning each speech to the appropriate candidate. You then run your analysis and bring up a deviation table. This table will give a list of words that are used frequently by each candidate, as well as a list of words that are not used very much. For example, one candidate has the word crime in their less frequently used list. This indicates they are focussing on crime less than the other candidates who are using that word more frequently. The same candidate has the word economy in their more frequently used list. This indicates they are focussing on economy more than the other candidates who use that word less frequently.
Worked Example - Casey Next
Casey is a council municipality in the south-east of Melbourne, the capital city of the state of Victoria, in Australia. In 2016 the City of Casey ran a short survey called Casey Next, aimed at engaging the community and identifying areas for improvement. This example takes a quick look at the results of that survey. The WordStat project files are linked in the zip file below, along with the original survey data.
Multiple questions were asked, however for this example we are going to look at the answers to the question "If you had the power to change just one thing in the City of Casey what would it be?" Once my data is loaded into WordStat, I click the Analyze button at the top of the WordStat screen (button shown below).
I select the text variable CHANGE_1 on the left select list, and I select all the group variables from the right select list. I then click the RUN EXPERT MODE button. The selections box is shown below. If you are analyzing the data for the first time and you do not already have a categorization dictionary, you will get asked to select an exclusion list. Select English, as this is the language used in the survey.
WordStat will open on the Text Processing tab. Select the Crosstab analysis tab from the menu list across the top of the program.
Once you are in the Crosstab tab you can access the deviation table. To open a deviation table, first select the group variable you are interested in from the With: drop-down list in the top left of the screen. I am going to select the WARD variable to start.
Then to load the deviation table, click the +/- button, located next to a set of other buttons, across from the Statistic: drop-down menu. The button is shown below:
Now we can see the deviation table.
This table is shows that residents of the Edrington ward mentioned improving public transport and traffic congestion more frequently than residents in other wards. The Edrington residents also talked less about Cranbourne than other wards. Cranbourne is a suburb of Melbourne that is largely within Casey's jurisdiction. The ward boundaries have changed since this survey was taken, but we could suppose that before the boundary change Edrington did not cover any of the Cranbourne suburb. Therefore, its residents had less reason to talk about it.
Looking at the rest of the table, residents of the Four Oaks ward mentioned community more than other wards. If we take the survey question into account, we could imply that these residents are most interested in an improved community experience. The residents of the Mayfield ward mentioned crime and infrastructure(trains, schools, parking), which implies these residents are most concerned with having less crime and better infrastructure. The residents of the River Gum ward mentioned people and parks more frequently, and transport, housing and congestion less frequently. This suggests those living in River Gum are more satisfied with traffic congestion and public transport than those in other wards.
The residents of the Springfield ward mention rates change more frequently, indicating they believe their rates are too high. From this you could imply the residents of Springfield are more affluent than those in other wards, given this is their main concern. Those who listed Visitor as their ward do not live in the City of Casey. Visitors mentioned free community transport more than any residents. Finally residents of the Balla Balla ward mentioned housing more frequently than the other wards, and were less concerned about improving community and parking.
The suburb of Cranbourne was mentioned more frequently by residents of Mayfield and Balla Balla, and less frequently by residents of Edrington, River Gum, and Springfield. From this you could determine that the suburb of Cranbourne was mostly within the boundaries of the wards Mayfield and Balla Balla.
Now let's see what the breakdown is by GENDER. To do this I close the deviation table, change the With: drop-down menu from WARD to GENDER, and click the deviation table button again. This gives me the following deviation table by gender:
From this gender breakdown, we can see that females mentioned parking more frequently, indicating they would like more places to park within the City of Casey. Parking was mentioned much less by males. There are a number of reasons why this might be the case. It is possible that more females work within the City of Casey, compared to more males who work outside Casey. In this case parking within Casey would not be as much of a concern for males. Perhaps more males use public transport than females and so do not need parking at all. It could also be that more females are responsible for shopping and pick-up/drop-off of kids at school or other activities, meaning they are more reliant on parking throughout their day and will therefore pay more attention to parking availability. Further investigation would be needed to clarify why parking is much more of an issue for females in this survey.
Females were much less concerned about public transport and roads, and males were much more concerned about this. This would suggest that males may use public transport more than females. Those who did not specify a gender (Prefer not to say) were most concerned about park facilities. The word change should probably be added to the exclusion list, since it is part of the survey question.
Deviation tables are useful for identifying differences between groups. It is especially useful for survey data such as this, to give a good indication of how to focus council efforts between wards and for different demographics.