Stata Regular Expressions - An Introduction
Regular expressions are a form of computational pattern-matching that allow you to extract specific information from a string variable. The string variable will need to have some structure, although even a very broad structure should be enough to extract some information.
To create a regular expression we use a set of characters that have special meanings to build a generic structure. Once this structure is built Stata can use it to search a string variable and extract parts of that variable. The special set of characters and how to use them are shown in the table below:
How to Use:
Building the Regular Expression:
For this example I will build a regular expression using the simple address structure of 1 Street Name Suburb 2000 STATE. The general structure here is a number, followed by one or more words, followed by a 4-digit number, followed by a 2 or 3 character state code.
**NOTE** I will use the underscore _ to represent spaces so they are easy to see. If you would like to follow along please remember to replace the underscores with normal spaces before using this regular expression.
Let's start with the first part of our address, which is the street number. As I know it is always going to be a number I use the range zero to nine in square brackets – [0-9]. This will search for any numbers in the string. I also know this number will be at the absolute start of the string, so I add the ^ start anchor before the square brackets – ^[0-9]. For this address string I know that there will be at least one digit, but there could be more than one digit in the address. The + symbol will find one or more of the preceding expression, so I add the + to my expression – ^[0-9]+. Finally, I know that the number will be directly followed by a space, so I add a single space after the + symbol. So my regular expression now looks like this:
Now we look at the next part of the string. There is a set of words which will have both upper and lower case letters, and may contain spaces, apostrophes and hyphens. This means I need both upper case and lower case letter ranges within my next set of brackets – [A-Za-z]. I can also add a space, a hyphen and an apostrophe within the square brackets so that these will be included in the search – [A-Za-z_'-]. Be aware that the hyphen is also used to indicate a range, so it must be placed at the end next to the closing square bracket in order for it to be detected as a hyphen rather than indicating a range. I know that my address will contain at least one word, so I again use the + symbol to indicate there will be one or more of the preceding characters – [A-Za-z_'-]+. After the set of words I add another space. My regular expression now looks like this:
The next part of the string, after the street name and suburb, is the postcode. All postcodes in Australia are exactly four numbers long. Since each item within a set of square brackets indicates to search for exactly one character, the postcode can be represented by four sets of square brackets with zero to nine ranges in them. This is represented as – [0-9][0-9][0-9][0-9]. The postcode is also followed by a space. My regular expression now looks like this:
Finally, the last part of our address is the state code. This section is a little trickier than the postcode, because in Australia some state codes have only two characters and some state codes have three characters. Because I know there will be at least two characters, I can represent these first two characters simply – [A-Z][A-Z]. There are two options for how to deal with the third character. You can use the pipe | to indicate the logical "or" and then search for either two or three characters. In this case I will instead use the question mark ? symbol as this is a simpler way to denote what I am searching for. From the description in our table above, we know that this ? symbol means that there will be either zero or exactly one of the preceding character set found in the string. If I add this to a third A-Z set then we are saying that sometimes there will be a third character and sometimes there won't be. This makes our state code regular expression – [A-Z][A-Z][A-Z]?. Finally, we know the state code is at the absolute end of our string. Nothing else will come after the state code. This means we add the $ anchor to indicate the end of the string – [A-Z][A-Z][A-Z]?$. Our full regular expression is as follows:
In order to use this in Stata to extract the different parts of the string, we need to add parentheses () around the sections we want to extract. To extract the street number we use – (^[0-9]+). To extract the street name and suburb we use – ([A-Za-z_'-]+). To extract the postcode we use – ([0-9][0-9][0-9][0-9]). To extract the state code we use – ([A-Z][A-Z][A-Z]?$). So our final regular expression with our chosen sections looks as follows:
Each () section is assigned a number from 1 to 9 in order from left to right. So for our expression above – 1 is used to extract the street number, 2 is used to extract the street name and suburb, 3 is used to extract the postcode, and 4 is used to extract the state code.
Using the Regular Expression to Extract Sections of the String:
Below I have attached the addresses dataset for this example as a csv file, which you can import into Stata. If you would like to follow along, download the csv file below and place it in your current working directory. You can find what your current working directory is by typing the command pwd in the command pane.
To start, we import the dataset using the following command:
We now apply our previously developed regular expression to extract each of the four elements of our string. Remember each () section is assigned a number, and this is given to the regexs() function to extract the appropriate element. For example, regexs(3) would extract the postcode. We extract each of the elements with the following commands:
That looks good, but now I want to extract the suburb. While there are some places in Australia where the suburb contains two words, for this dataset all suburbs are one word only. This makes extracting the suburb relatively easy, however we will need to modify our regular expression because I now want to extract the last word before the postcode. The easiest way to do this is to remove the first part of the regular expression. So far in this example we have used a regular expression that gives the whole string structure. To extract the suburb I can use a regular expression that doesn't indicate where the beginning of the string is, allowing me to search for a subsection of the address string. To do this I remove everything before the postcode, and then add a word search that does not include the space. So before the postcode I add – ([A-Za-z'-])+. The difference is subtle, but it means I should now be able to separate out the suburb. My regular expression now looks like this:
There is one important point to be aware of. Because I am changing the sections, the section numbering will change because it is sequenced from left to right. So with my new regular expression, to extract the postcode I would now need regexs(2) and to extract the state code I would now need regexs(3). To extract the suburb I use regexs(1).
I extract the suburb using the following command:
Our suburb has been successfully extracted. It is also possible to extract the street name, however it is easier to use the Stata command split for this purpose. You can split the street_suburb variable into several variables each containing the individual words. Since we know that the last word in street_suburb is the suburb, the rest of the words will be part of the street name.
To build the street name, you keep adding the next word as long as the subsequent word also contains a word. If the next word variable is missing, then the word you are trying to add is the suburb. So as long as the next word is not missing, the word you are adding is part of the street name.
For example, I add the word in variable name1 and variable name2 together only if the name3 variable contains a word. If the name3 variable does not contain a word and is instead missing, then the word in name2 must be the suburb.
To extract the street name I use the following commands in Stata:
I now have every part of the street address in separate variables.
The address dataset I used here is a subset of a much larger set of Australian addresses with a much more varied structure. In the original dataset some issues I came across included that suburbs in Australia can contain either one or two words, which makes extracting the suburb much more difficult than it was in this example. There are also issues with street names, as not all street names have a suffix like "Road" or "Avenue". For example, there is a street name in Australia called "The Causeway" which can be difficult to separate from its suburb using regular expressions.
Building regular expressions for your own data will involve some trial and error, as you usually do not have a string variable as clearly structured as the one we used here. When deconstructing a string variable with regular expressions I find it useful to reconstruct the string variable and compare the reconstructed string to the original. This way I can search for any errors in my regular expressions, as the reconstructed string variable will not match the original variable in places where the regular expressions failed to properly extract the information.
Data is not always given to us in an appropriate format. There are many situations where information is collected together as a single long string. Regular expressions are a very powerful tool for extracting otherwise inaccessible information from these strings.