The Reshape Command - Long and Wide Datasets

The reshape command provides an easy way of moving your dataset between wide and long formats. A simple explanation of wide vs long is that a wide dataset holds all information for an ID variable in a single observation, whereas a long dataset holds information in multiple observations per listed ID.

A Long Dataset

A dataset in long format is where there are multiple observations of a single “primary ID” variable, with a “secondary ID” variable that uniquely identifies each observation within the “primary ID” variable, and a third “information” variable that holds some information of interest. For example, a “primary ID” variable could be a family name, with a “secondary ID” variable that numerically records which member of the family each observation represents (e.g. 1=Mum 2=Dad 3=eldest child 4=youngest child), and a third “information” variable that records the age of each different family member. This example is set out in the table below:

A Wide Dataset

A wide dataset contains only one observation per “primary ID” variable, and it separates the “information” variable contents into multiple variables, with each new variable representing a different “secondary ID” in order to fit all the information into one observation. This means in our family name example, rather than 4 observations each with a different age, there are 4 different age variables each recording a different family member’s age. Usually this will arbitrarily be recorded as age1 for the first family member, age2 for the second, and so on. This example is set out in the table below:

From Long to Wide

To use the reshape wide command, you need a dataset in long format, with clearly identifiable “primary ID”, “secondary ID” and “information” variables (as described above). All other variables in your dataset must be constant within each “primary ID”.

From Wide to Long

To use the reshape long command, you need a dataset in wide format, with clearly identifiable “primary ID” and “information” variables. The “information” variables should all have the same stub name and they should all contain the same type of information in the same format. For example, the variable stub name “age” with 2 variables “age1” and “age2” should be present in the dataset, with the age in each individual “age” variable recorded as age in years (same type of information) and in numeric form (same format).

How to Use:

Reshape long to wide:

reshape wide info_variable, i(primary_id) j(secondary_id)

Reshape wide to long:

reshape long info_stub_varname, i(primary_id) j(choose_varname)

Worked Example 1 (Long to Wide):

In this example I use the Stata example dataset “bplong” to demonstrate reshaping a long dataset to a wide dataset. In the command pane I type the following:

sysuse bplong, clear
browse

reshape wide bp, i(patient) j(when)
browse

As you can see, the data has been reshaped to wide and we now have 2 bp variables. When it reshapes like this, Stata takes the “information” variable name, and adds the “secondary ID” value to it to create the new variable names. In this case the values were numeric, with 1 representing “Before” and 2 representing “After”. If the “when” variable was a string variable rather than a numeric variable, you would have to add the “string” option to your reshape command. This would then generate variable names of “bpBefore” and “bpAfter” rather than “bp1” and “bp2”.

Worked Example 2 (Wide to Long):

In this example I use the Stata example dataset “bpwide” to demonstrate reshaping a wide dataset to a long dataset. In the command pane I type the following:

sysuse bpwide, clear
browse

reshape long bp_, i(patient) j(when) string
browse

As you can see, the data has been reshaped to long and our 2 bp variables have been merged into one. When it reshapes like this, Stata takes the “information” variable stub name (in this case bp_), and takes the second part of the “information” variable name as the “secondary ID” values, creating a new variable which it names the string you gave for j. In this case the second part of the bp_ stub name are words (before and after), so we have to specify the string option so Stata knows these are not numeric values. If the variable names had been “bp_1” and “bp_2” the string option would not have been needed.