Updated: 2 days ago
When you load a dataset into Stata, it keeps the data in memory. This makes Stata fast and safe for user interaction. However, when you add variables derived from other variables, the dataset can become larger. As the dataset expands, it will still need to fit in to the available memory. When dealing with expanding datasets you may want to reduce the size of your dataset to reduce the amount of memory used. Today I would like to introduce three methods to reduce the size of your datasets in Stata.
Method 1: The compress command
You can use compress with a varlist, to specify certain variables. If you do not give a varlist, it acts on all the variables. The compress command works on both string and numeric variables.
NOTE: Stata will not reduce the precision of your data by compressing the data.
I will use automobile dataset to demonstrate how you can use this command. Let's pretend that the automobile dataset is extremely large with several thousand variables:
NOTE: Memory compression is a concern only when you are dealing with big datasets.
Method 2: The use command
Another way to manage your memory is to only load the necessary variables into Stata. Sometimes a dataset can have thousands of variables and billions of observations. You may only care about a few variables or a range of observations for your analysis. The command use is an effective way to achieve this. Again, let's pretend that the automobile dataset is extremely large with several thousands variables, but for our analysis, we may only want the variables mpg (Mileage) and weight. To load just these two variables, I would type:
Method 3: The describe command
If you are not sure which variables you want to load into memory, before you choose the variables, it is possible to explore a dataset without loading it into memory. The describe command allows you to do exactly that. I will use the automobile dataset to demonstrate how to use this command. We can see from the below output that Stata gives you the storage type for all the variables:
If your dataset has thousands of variables, you might need to subset the variables list. For example, in the automobile dataset, if I would like to know just the variables that start with "t", I would type: