The Tabstat Command - Comprehensive Summary Statistics

The tabstat command is used to display summary statistics for numeric variables. This command can be used to display the following statistics:

  • mean

  • count/n (number of observations or N)

  • sum (every observation in a variable added together)

  • max

  • min

  • range (minimum subtracted from maximum)

  • sd (standard deviation)

  • variance

  • cv (coefficient of variation – sd/mean)

  • semean (standard error of the mean – sd/sqrt(n))

  • skewness

  • kurtosis

  • p1 (first percentile)

  • p5 (fifth percentile)

  • p10

  • p25

  • p50/median

  • p75

  • p90

  • p95

  • p99

  • iqr (interquartile range – p25 subtracted from p75)

  • q (quartiles, equivalent to specify p25 p50 and p75)

You can specify these statistics in any combination and in any order as part of the tabstat command, which allows you to get more comprehensive summary statistics then you can from the summarize command. It also allows you to tailor a summary statistics table to your needs.

Please note: this command does not work on string variables.

How to Use:

OR, if all variables in your dataset are numeric (ie not strings) and you want them all included in the summary table:

Worked Example 1:

Using the auto dataset, I am going to generate a small table with the first percentile (p1), the ninety-ninth percentile (p99), and the range. I will apply these statistics to the weight and length variables. In the command pane I type the following:

Which gives the following output:

In this case the first and ninety-ninth percentiles are giving the same values as I would get from specifying min and max. In a larger dataset there may be a difference between p1 and min, and p99 and max. The range is telling me the difference between min and max (and p1 and p99 in this case).

Worked Example 2:

The first worked example shown is great if you just want to look at some statistics for a few variables. However, you are able to use the asterisk (*) to specify all variables provided there are no string variables in your dataset. If you try to use the asterisk while there are string variables in your dataset you will get an error.

If you have string variables in your dataset, there are one of two ways you can deal with this in order to be able to use the asterisk (*) method. The first option is to use the -encode- command to convert your string variables to numeric variables, making sure to drop the string variables from the dataset once you have finished encoding them. The second option is to use the preserve command. In this case you “preserve” your data, then drop your string variables from your dataset, perform the tabstat command with the asterisk, and then “restore” your data with the restore command to get your string variables back.

In this example I am going to use the preserve method. If you would like to know more about encoding variables, check out this tech tip: Encode and Decode Commands.

In the command pane I type the following:

Once I type restore the dataset is restored to how it looked when I initially preserved it (ie before I dropped the string variable “make”), and the results of the tabstat command are shown in the results pane.

In this example I knew how many string variables there were in my dataset (just one) and the variable name, making it easy to drop my string variable and subsequently perform the tabstat summary on all remaining variables. If you have quite a few string variables, or you don’t know how many string variables you have, you can easily find and remove them all using the following two lines of command line code:

Please make sure to preserve your dataset before removing the string variables, so you can easily restore your string variables once you have used tabstat.

1,239 views0 comments

Recent Posts

See All