The Split Command – Separating Your String Variables
The split command in Stata allows you to separate a string variable into multiple string variables. Stata will split the variable by a separator. The default separator is a space character, however you can specify whichever separator you need to adequately split your variables. This command is very useful for splitting off some information that you want to keep in separate variables.
How to Use:
To specify a different separator:
split variable, parse(character)
To specify the name of the new variables created:
split variable, generate(newvariablename)
There are other options for this command that can be useful depending on your goal. Check out the help file with the command help split to learn more about the options available.
Worked Example 1:
In this example I am going to use the Stata example dataset auto.dta. In this dataset there is one string variable called make, which contains both the make and model of each car in the dataset. I would like to separate this into two variables – one that contains the make and a second that contains the model name. The advantage of doing this is that I can then convert the make variable into a categorical variable. This allows analysis of the cars to look for differences that could be attributed to the car manufacturer.
In the command pane I type the following commands:
sysuse auto, clear browse make
This shows the make variable in its original form:
As you can see the variable contains words separated only by spaces, so we only need the default space separator.
I am now going to split and rearrange this variable. In the command pane:
split make, generate(model) drop make encode model1, generate(make) generate model = model2 + " " + model3 drop model1 model2 model3 label variable make "Car Manufacturer" label variable model "Car Model Name" order make model browse make model
The browse command shows the changes made:
As you can see, now the make and model are in two separate variables. The make variable has been converted to a categorical (numeric) variable with labels attached using the encode command. To learn more about the encode command check out this tech tip.
Worked Example 2:
In this example we are going to use the Stata example dataset pop2000.dta. In this dataset is an age-group variable saved as a string, called agestr. We are going to split this variable and I am going to give the variable name age, which should generate two variables of age1 and age2. These two variables will contain the lower and upper age numbers. In the command pane:
sysuse pop2000.dta, clear browse agestr
This variable contains age ranges, each separated by the word “to” except for this first one. Since we can add multiple parse strings to a split command, in this instance I am going to use both “to” and “Under” to split the string. To separate this string variable we use the following commands:
split agestr, parse("to" "Under") generate(age) browse agestr age1 age2
Here you can see the ages have been successfully separated. If you wanted to convert these into numeric variables you would use the destring command. For example:
destring age1, replace destring age2, replace
If you are looking to pull out a specific text sequence not easily separable by either a single character or multiple alternate characters, you would be better off using Stata’s powerful regular expression commands. Check out this tech tip to learn more about regular expressions in Stata.