Regular Expressions in Stata - A Practical Example

Regular expressions can be used to extract information from text data. It is a useful tool with widespread applications. For example, you can use regular expressions to extract personal information from text data such as address, name, email address, phone number, and much more. The text data will usually need to be somewhat structured, for example name always being followed by the address.

Another useful application of regular expressions is to extract information from a bank statement. This is something that is useful for everyone, as it offers the ability to examine all your transactions and can give you a granular view of what your spending and saving habits really are. I find this a useful tool for tracking how well my household is following our budget.

Using a fabricated bank statement from a fictional bank, this example will show you how to go about analysing your own bank statements. Some banks will allow you to download the information as a csv file, which can be directly imported into Stata. I recommend you use this feature provided no information is excluded from the csv that is in the PDF. For this example I first converted the statement from PDF to excel and organised it. If you are interested in how I converted, imported, and organised the data ready for analysis, please check out this PDF to Stata Tech Tip.

If you would like to follow along, you can use the following dataset:

Example Bank Statement

Using Regular Expressions to Extract Transaction Information

I am going to extract the following pieces of information from this statement: was the transaction a visa transaction, was the transaction a transfer between accounts (internal or external), what the original purchase date was, the card number used for purchases, the receipt number for each purchase, the shop name for purchases, and the location of the shop.

To start lets look at the data:

By looking at the description we have for each transaction in the details variable, we can see a pattern. For purchases, the description follows this template:

Visa Purchase - Receipt number SHOP NAME LOCATION Date dd/mm/yy Card number

So to begin with let’s record whether or not items were visa transactions. All visa transactions have the word “Visa” in the transaction description (as shown in the template above), so it is simple enough to create a binary variable for this. In the command pane:

generate visa = 1 if strpos(details, "Visa")
replace visa = 0 if missing(visa)

You will note from the above code that I am using the strpos() function rather than regular expressions. In this case since all I am doing is looking for the presence of a particular word, it is faster and easier to use the strpos() function to determine if the word is present. We can do the same thing for determining if transactions are transfers between accounts. All the money transfer transaction descriptions contain the word “Transfer”. In the command pane:

generate transfer = 1 if strpos(details, "Transfer")
replace transfer = 0 if missing(transfer)

Now I want to extract the purchase date from the transaction details. The purchase date is different from the date supplied on the statement because a lot of purchases remain pending for several days before the transaction is completed. As this is an extraction rather than simply checking the presence of a word, I will need to use regular expressions. By looking at the template for the descriptions again, I know that the purchase date is shown as “Date dd/mm/yy”. As this format is universal, I can use it to extract the purchase date in the following manner:

generate p = regexs(1) if regexm(details, "Date ([0-9][0-9]/[0-9][0-9]/[0-9][0-9])")
generate pdate = date(p, "DM20Y")
drop p
format pdate %tdDD_Mon_CCYY

Now that I have my purchase date, I can calculate how many days the transaction took to go from pending to processed. In the command pane:

generate pending = date - pdate
collect: mean pending
collect label levels result _r_b "Mean", modify
collect layout (colname) (result[_r_b _r_se _r_ci), name(default)

This produces the following table:

From this we can see that on average it took 2 days for the bank to clear money for a purchase.

Returning to our purchase template, I am now going to extract the shop name and location for each purchase. From the template, I know that the word “Receipt” is followed by a number, which is then followed by the shop name, and sometimes is also followed by another number and a location name as well. So our regular expressions look like this:

generate shop = regexs(1) if regexm(details, "Receipt [0-9]+ ([A-Z \*]+) [0-9A-Z ]*Date") & visa == 1
generate location = regexs(1) if regexm(details, "Receipt [0-9]+ [A-Z ]+[0-9]+ ([A-Z ]+) Date") & visa == 1

In the first regular expression, there is a backslash followed by an asterisk or “\*”. This is important because in my transaction descriptions there is sometimes an asterisk as part of the description. However, the asterisk in Stata regular expressions has a special function, so in order to get Stata to look for a normal asterisk I need to escape the character with a backslash. There are several other special regular expression characters that you will need to do this with if you are searching for them in text.

In the second regular expression I am a little more explicit compared to the first. I state that the word “Receipt” is followed by a number, then some words, then some more numbers, then some more words. It is the second set of words that I am interested in here as these will denote the location if they exist. If they do not exist then nothing will be extracted.

Finally, I can extract the receipt number and the last 4 digits of the purchase card using the following regular expressions:

generate card = regexs(1) if regexm(details, "Card ([0-9]+)")
generate receipt = regexs(1) if regexm(details, "Receipt ([0-9]+)")
destring receipt, replace
destring card, replace

Now, I want to look at all the different shops. If you are looking through your own transactions, then how you go about this is going to depend somewhat on how many different places you shop at. In the case of my fake statement here, there aren’t that many shops so I am simply going to encode the shop variable. Once encoded, I can then have a quick look at the list of shops with the table command. Once I’ve done that, I am going to list the label that is attached to the shop variable. This will print a list of shops and their corresponding numbers, which I can then use to assign different categories to. In the command pane:

encode shop, generate(s) label(shoplbl)
drop shop
rename s shop
table (shop), statistic(frequency)

This produces the following table:

From this table we can see the most frequent purchases were made at the online game store “Steam”, and the supermarket “Woolworths”. Now lets have a look at the assigned value label numbers so we can create our categories:

label list shoplbl

This prints the following list:

1 BP PARK ROAD
2 BUNNINGS
3 CALTEX CITY HIGHWAY
4 COLES EXPRESS
5 DELIVEROO
6 PAYPAL *STEAM GAMES
7 PAYPAL *UBEREATS AU
8 WOOLWORTHS

From this list, we assign purchase categories:

generate cat = "Transport" if shop == 1 | shop == 3 | shop == 4
replace cat = "Home Improvement" if shop == 2
replace cat = "Eating Out" if shop == 5 | shop == 7
replace cat = "Groceries" if shop == 8
replace cat = "Entertainment" if shop == 6

At the moment, we have only looked at purchases. However, there are internal transfer payments, both credits and debits, that have so far not been categorised. As all the internal transfers have descriptions that were given at the time of transfer, we can use the strpos() function again to apply categories according to what is contained in the description. In the command pane:

replace cat = "Salary" if strpos(details, "Salary")
replace cat = "Rent" if strpos(details, "Rent")
replace cat = "Savings" if strpos(details, "Savings")
replace cat = "Utilities" if strpos(details, "Origin")
count if missing(cat)

The last line is used to check whether there are any transactions that have not been assigned a category yet. As the number returned here is 0, we know that all transactions have been categorised. We can now encode our category variable to further analyse our transactions. In the command pane:

encode cat, generate(category) label(catlbl)
drop cat
table (category), nototals statistic(total amount) statistic(mean amount)

This produces the following table:

This table shows the total purchase amount as well as the average cost of each purchase, however we don’t know what time period we are looking at. Is this over a week, or a month, what is the date range. We can easily display the date range as follows:

summarize date
display %tdDD_Mon = r(min)
display %tdDD_Mon = r(max)

This prints the minimum and maximum dates to the results pane in Stata. The starting date for this statement is 2 January, and the ending date is 31 March, so this statement covers three months worth of transactions. I would like to add a column to the above table to show the average amount per month for each category. I can use the month() date function to create an extra variable separating observations by month, which can then be included in the table command. In the command pane:

generate month = month(date)
table (category) (month), statistic(total amount) statistic(mean amount) totals(category)
collect layout (category) (month#result[total] month[.m]#result[mean])
collect style header result[total], level(hide)
collect style column, nodelimiter dups(center)

This produces the following table:

This table lists the total amount for each category for each month in the statement. It also has the information from our previous table which shows the 3-month totals and the average amount per transaction.

Finally, I would like to create a table that summarises purchases made at each of the physical locations. To do this I first need to convert the “location” variable to a numeric categorical variable. In the command pane:

encode location, generate(loc) label(loclbl)
drop location
rename loc location
table (location) (month), statistic(total amount) statistic(mean amount) totals(location)
collect layout (location) (month#result[total] month[.m]#result[mean])
collect style header result[total], level(hide)
collect style column, nodelimiter dups(center)

This produces the following table:

As demonstrated with this analysis, regular expressions in Stata can be used to analyse all sorts of text data. It can also be useful in extracting or converting numeric data that is contained within text data. The ability to extract meaning from large quantities of text data is important, as most data we encounter needs to be extracted, structured and condensed before it can be analysed. Stata is the perfect tool for these complex analyses.