Getting PDF Data into Stata: A Practical Example

Data and information are not always stored in easily accessible forms like spreadsheet or text files. One form data can be presented in is PDF files. From bank statements to company financial reports to academic studies, this data is not always made available in other formats for analysis. This information is still accessible, it just needs to be organised first.

Using a fabricated bank statement from a fictional bank, this example will show you how to import data from a PDF. I then go through the data management required to get this bank statement into a format that can be analysed. I do include several data management steps that are unnecessary for this fictional bank statement, but that I have found necessary when analysing my own statements. Each PDF will need to be managed differently, but this example should help you to understand how to go about arranging the data.

Step 1: Conversion

The first step is to convert your PDF into a different format. The easiest way to do this by highlighting the data in the PDF, copying it, and pasting it into a spreadsheet. However, this can sometimes destroy the structure of the data, making it impossible to organise and therefore preventing meaningful analysis. I prefer to convert the PDF into a spreadsheet. If you are looking at a large PDF file I suggest extracting the individual pages with the data on it you want to look at before you convert. To follow along, you can download my fictional bank statement below:

Fictional Bank Statement

To convert this PDF into an excel file I used the free online tool https://www.pdftoexcel.com/. The Adobe PDF to Excel converter is also a good tool, however it requires an annual fee. Please note the conversion can take a few minutes to complete.

Once you have your statement as an excel file, you can load it into Stata. I have included the converted file for my fictional bank statement below:

Fictional Bank Statement Excel

I am now going to import my converted excel file and perform some data management to get the data into a format that I can analyse. I recommend using the menus to import your excel file so you are able to see an overview of what format the converted file has been saved in.

In the command pane:

import excel "exampleBankStatement.xlsx", cellrange(A11) firstrow case(lower) allstring clear
rename transactiondetails details
browse

The imported data initially looks like this:

Looking through the dataset there are a few things to note. Firstly, the headers for each page have been brought across as observations and repeated multiple times within the dataset. The page numbers have also been brought across as their own observations. There is an observation at the end which contains the words “end statement”. And finally, the last transaction entry for each page has been brought across into the wrong column. All transactions have two to three observations, with extra information in the details variable split across multiple observations. For the last transaction per page, this information is in the date variable instead of details.

This dataset will need to be modified in order to get it into a form that can be analysed. To begin with I run the following in the command pane:

replace details = date if !missing(date) & missing(details)
replace date = "" if date == details
drop strpos(date, "Date")
drop if regexm(details, "Page [0-9]+ of [0-9]+")
drop if missing(date) & missing(details)
drop in 213

The first two commands move the “details” into the correct variable where it has been placed into the date variable erroneously. The third and fourth commands remove the observations containing the extra header lines and page numbers. For information on the regexm() function, please check out our Tech Tip on Regular Expressions. The fifth command drops any wholly missing observations, and the last command drops the line containing “end statement”.

For some bank statements, there may be multiple lines at the end of the excel file that contain summary information from that statement. To drop all these together, find the observation number for the first line you want to drop and then add “/l” (lower-case L) to it. For example, if your first line is observation 213 and your last line is observation 220 you can use the command drop in 213/l to drop all lines 213 to 220. Your statement may also contain extra lines which also need to be removed. You can use either the strpos() or regexm() methods to drop these observations depending on what information they contain. For example, you might have an extra line with “Transactions” in it that appears above the header on each page. To remove this, you could use the command:

drop if strpos(variable_name, "Transactions")

Now that I have removed all the erroneous observations, I need to combine the details information into a single observation. To do this, I use the following commands:

replace details = details + " " + details[_n+1] if !missing(date) & missing(date[_n+1])
drop if missing(date) & !missing(date[_n-1])
replace details = details + " " + details[_n+1] if !missing(date) & missing(date[_n+1])
drop if missing(date) & !missing(date[_n-1])
count if missing(date)

To explain what is happening here, the first command is adding the information in details from the second observation (n+1) to the information in details from the first observation (n). The if qualifiers are stating that this should only be done in cases where the nth observation has information in the date variable, and the nth+1 observation does not have anything in date (is missing). The second command is then dropping the nth+1 observations because the information contained in those observations has been appended to the nth observation. This command does this by dropping all observations where there is no date, but where there is a date in the previous observation (n-1).

These two commands are repeated until all the data has been collated into a single observation per transaction. I test whether all data has been collated with the final count command, which should display 0 to the results pane. If you get a number greater than 0, run the two commands again and check again until you get 0.

The dataset now looks like this:

Now that I have collected all transaction information into singular observations, there are only a few more minor changes before the dataset will be ready to analyse. To start I am going to convert the date variable from a string to a numeric date variable. In the command pane:

rename date d
generate date = date(d, "DMY")
drop d
order date
format date %tdDD_Mon_CCYY

In some transaction statements the withdrawal, deposit, and balance variables can become merged in places. This is not the case for my fictional statement, however I will demonstrate how to go about fixing this problem if you encounter it. In the command pane:

count if !missing(withdrawal) & !missing(deposit)

This command is a simple count to see if there are any observations with balance in the wrong place. Since each transaction is going to be either a withdrawal or a deposit, if both these variables contain values then one of them is going to be the balance. As most statements list withdrawals and then deposits, we can assume that the number in the withdrawal variable is correct, and the number in the deposit variable is therefore the corresponding balance. If this count comes up as greater than zero, then your numbers are merged and you can fix this with the following commands:

replace balance = deposit if !missing(deposit) & !missing(withdrawal) & missing(balance)
replace deposit = "" if !missing(deposit) & !mising(withdrawal)

The first command copies the balance across from the deposit variable to the balance variable. The second command removes the duplicate from the deposit variable. I am now going to combine the withdrawal and deposit variables in a single amount variable. Please make sure to check your withdrawal variable before doing this, as some bank statements do not show withdrawals with a negative (-) symbol. If this is the case you need to convert them to negative before combining. If the withdrawal variable is a text variable you can use this command:

replace withdrawal = "-" + withdrawal if !missing(withdrawal)

Alternatively, if your withdrawal variable is numeric you can use this command:

replace withdrawal = -withdrawal

Once your withdrawal variable contains negative values and your deposit variable contains positive values, you can then combine them into a single variable. If your withdrawal and deposit variables are still text, this is easily accomplished with:

generate amount = withdrawal + deposit
drop withdrawal deposit

Alternatively, if your variables are both numeric:

generate amount = withdrawal
replace amount = deposit if missing(amount)
drop withdrawal deposit

You may also come across some observations where the withdrawal/deposit and the balance are in the same variable. This is usually due to merged cells in the excel file. The easiest way to fix this is to find and de-merge the cells in the excel file and then re-import the data. Alternatively, you can use the split command to split the offending variable by whitespace. This should create several variables each containing the withdrawal/deposit amount and balance separately. You then need to place these into the appropriate columns. For example, if the deposit and balance are both contained in the deposit variable you could do the following:

split deposit
replace balance = deposit2 if missing(balance) & !missing(deposit2)
replace deposit = deposit1 if !missing(deposit) & !missing(deposit2)
drop deposit1 deposit2

Although again not necessary for this fictional statement, I will demonstrate how to remove commas and dollar-signs from the amount and balance variables as these can sometimes be imported when the transaction amounts are read in as strings. In the command pane:

replace balance = subinstr(balance, ",", "", .)
replace balance = subinstr(balance, "$", "", .)
replace amount = subinstr(amount, ",", "", .)
replace amount = subinstr(amount, "$", "", .)

This simply uses the subinstr() function to replace any dollar-signs or commas with nothing. Finally, I am going to destring the balance and amount variables (as they are still strings in my dataset) and save the dataset for analysis. In the command pane:

destring balance, replace
destring amount, replace
save bankData, replace
browse

The final dataset now looks like this:

Depending on the PDF converter you use and the structure of the PDF itself, the process to get the data into a format that can be analysed may be more straightforward than what is shown here, or it could also require extra steps that were not shown here. Stata has many data management tools that will allow you to access data in many different formats, even those that are more difficult to access.