The Tabstat Command - Comprehensive Summary Statistics

The tabstat command is used to display summary statistics for numeric variables. This command can be used to display the following statistics:

You can specify these statistics in any combination and in any order as part of the tabstat command, which allows you to get more comprehensive summary statistics then you can from the summarize command. It also allows you to tailor a summary statistics table to your needs.

Please note: this command does not work on string variables.

How to Use:

tabstat var1 var2 var3, statistics(stat1 stat2 stat3)

OR, if all variables in your dataset are numeric (ie not strings) and you want them all included in the summary table:

tabstat *, statistics(stat1 stat2 stat3)

Worked Example 1:

Using the auto dataset, I am going to generate a small table with the first percentile (p1), the ninety-ninth percentile (p99), and the range. I will apply these statistics to the weight and length variables. In the command pane I type the following:

sysuse auto, clear
tabstat weight length, statistics(p1 p99 range)

Which gives the following output:

In this case the first and ninety-ninth percentiles are giving the same values as I would get from specifying min and max. In a larger dataset there may be a difference between p1 and min, and p99 and max. The range is telling me the difference between min and max (and p1 and p99 in this case).

Worked Example 2:

The first worked example shown is great if you just want to look at some statistics for a few variables. However, you are able to use the asterisk (*) to specify all variables provided there are no string variables in your dataset. If you try to use the asterisk while there are string variables in your dataset you will get an error.

If you have string variables in your dataset, there are one of three ways you can deal with this. The first option is to use the encode command to convert your string variables to numeric variables, making sure to drop the string variables from the dataset once you have finished encoding them to use the asterisk. The second option is to use the preserve command. In this case you “preserve” your data, then drop your string variables from your dataset, perform the tabstat command with the asterisk, and then “restore” your data with the restore command to get your string variables back. The third option is to use the ds command with the not(type string) option, and then replace the asterisk with the macro `r(varlist)’.

In this example I am going to use the preserve method. If you would like to know more about encoding variables, check out this tech tip: Encode and Decode Commands.

In the command pane I type the following:

sysuse auto, clear
preserve
drop make
tabstat *, statistics(mean sd min max semean q)
restore

Once I type restore the dataset is restored to how it looked when I initially preserved it (ie before I dropped the string variable “make”), and the results of the tabstat command are shown in the results pane.

In this example I knew how many string variables there were in my dataset (just one) and the variable name, making it easy to drop my string variable and subsequently perform the tabstat summary on all remaining variables. If you have quite a few string variables, or you don’t know how many string variables you have, then you use the ds command instead. For example:

ds, not(type string)
tabstat `r(varlist)', statistics(...)