- Laura Whiting

# An Overview of Linear Regression Post-Estimation Commands

Updated: Apr 2

Stata contains a wide range of post-estimation commands. These are commands that you run after an estimation command, such as a regression. You can use post-estimation commands to test underlying assumptions, make predictions, analyse residuals, look for influences that may be skewing your model, and test the robustness of your model. Below I list some post-estimation commands commonly used with linear regression models and why you would use them.

This is not an exhaustive list, however it includes all commands listed in the Stata help page as being of 'special interest' to linear regression. While there are many post-estimation commands available, these are the most useful when testing linear regression models. You can read more about linear regression post-estimation by typing "help regress post-estimation" in the command line in Stata.

__dfbeta__

__dfbeta__

This command is used to measure the influence of observations on a single coefficient in the model. Each independent variable has its own coefficient, which is a measure of the effect the independent variable is having on the dependent variable.

The dfbeta compares the coefficient value when an observation is included in the regression model, versus the coefficient value when the same observation is excluded. The dfbeta is used to help identify individual observations that are having an above-average amount of influence on the model.

__estat hettest__

__estat hettest__

This is a test for heteroskedasticity. This tests for equal variance, which is an underlying assumption of regression models. Regression models with equal variance are considered to be homoskedastic, so we test for heteroskedasticity to determine if there are unequal variances that are skewing our model. This particular test only looks for linear cone-shaped heteroskedasticity. Running this command will give you a p-value, and anything under 0.05 (or your chosen p-value cutoff) indicates there is heteroskedasticity in your model.

**Note: **This test relies heavily on the assumption that your residuals are normally distributed. This test will be biased if run on a regression where the residuals are **clearly **not normally distributed. Always check if your residuals are normally distributed before running this test.

__estat imtest__

__estat imtest__

This is an information matrix test that looks for heteroskedasticity, skewness and kurtosis. Heteroskedasticity is when the variances are unequal. Skewness relates to the distribution curve, where your residuals are skewed if the curve is asymmetrical (the left side does not match the right side of the graph). Kurtosis relates to the distribution curve tails, where your residuals have kurtosis if the tails are excessively long or short compared to a normal distribution.

Linear regression works on the assumption that your variances are equal, and your residuals normally distributed with no or little skewness or kurtosis. This means the presence of any significant heteroskedasticity, skewness or kurtosis will affect the robustness and accuracy of your model. The heteroskedasticity part of this test is more general than the test performed by *estat hettest*, as it covers multiple forms of heteroskedasticity.

**Note: **Please be aware that skewness and kurtosis are only important for your residuals. An OLS linear regression assumes a normal distribution of residuals ** only**. Your dependent and independent variables can be of any distribution, as skewed or flat or tall as is possible, and this will not affect the validity of your model.

__estat ovtest__

__estat ovtest__

This is the Ramsey (1969) regression specification-error test (RESET), which tests for misspecification of linear regression models. There are 3 reasons for a misspecified model. Either an important variable has been left out; an unnecessary or irrelevant variable has been included; or there is functional form misspecification (the right independent variables have been used but the model still cannot account for the relationship between the dependent variable and the independent variables). For a linear regression, functional form misspecification indicates you are applying a linear model to non-linear relationship(s).

This RESET test can be used to look for two types of misspecification. It can look for omitted variables, or functional form misspecification, depending on the options you choose when you run the test.

__estat szroeter__

__estat szroeter__

This is a rank test for heteroskedasticity, which is an alternative to the score test used in *estat hettest*. This heteroskedasticity test looks at individual variables, testing one at a time. You can either specify the variables you want to test, or use the rhs option to test all the independent variables.

This test is more general than *estat hettest* but more specific than *estat imtest*, due to this test identifying whether variance is increasing monotonically. Monotonic simply means one-way. If you have variance that is increasing monotonically, it means it is never decreasing. This test can identify the cone-shaped heteroskedasticity that is shown by *estat hettest *(as this is monotonic), as well as all other forms of monotonic heteroskedasticity. However, it will not detect heteroskedasticity that is not monotonic, e.g. hourglass-shaped.

Please note if you are running this test on multiple variables (e.g. with the rhs option) it is recommended you employ one of the multiple testing adjustments available with the mtest option. There are three adjustment options, either 'bonferroni', 'holm' or 'sidak'.

__estat vif__

__estat vif__

This calculates the variance inflation factors (VIFs) for the independent variables in your model. The VIF is the ratio of variance in a model with multiple independent variables (MV), compared to a model with only one independent variable (OV) – MV/OV. It is used to test for multicollinearity, which is where two independent variables correlate to each other and can be used to reliably predict each other.

Multicollinearity is a problem in regression models because the model is looking at the relationship between one independent variable and the dependent variable, while keeping all other independent variables constant. If you have multicollinearity this means changes in one independent variable will likely cause changes in another, thereby violating the premise that all other independent variables are kept constant. This command allows you to identify if multicollinearity is present in your model.

**estat esize**

This calculates effect sizes for a linear regression or an ANOVA. An effect size measures either the size of an association between variables, or the difference between groups of means. A bigger effect size means a stronger association or larger difference, and a smaller effect size means a weaker association or smaller difference. This test reports eta-squared estimates by default, which are equivalent to R-squared estimates.

This test can be important in helping you to identify whether your model gives meaningful results. A statistically significant linear regression model tells you there is an association between variables, but it does not tell you how strong that association is. If you have a very small effect size for your regression model, this tells you that although there is an association or correlation between variables, that correlation is very weak. A weak association usually means your model is not particularly useful even if it is significant.

It is also important to note that effect size is independent of sample size, something that is not true of significance tests. It is for this reason that some believe effect sizes to be a more useful measure of the validity of a model.

**estat moran**

This test is only for regressions run on spatial data. It performs the Moran test for spatial correlation, also known as spatial autocorrelation, which is the amount of similarity between regression residuals linked to similar spatial locations. This test operates on the null hypothesis that observations are independent and identically distributed (i.i.d). Any p-value below 0.05 (or your chosen cut-off) indicates statistically significant spatial correlation in your regression residuals.

**predict**

This command is used to make predictions and look at influence statistics. For example, you can make predictions of the dependent variable, or find the probabilities of a positive outcome. You can also use this command to calculate the residuals for further analysis.

These are just a few of Stata's many post-estimation commands. The type of regression you run will determine which post-estimations are the most useful to you. For a list of post-estimation commands for your regression you should check the post-estimation help file for that regression. To access, type *help command postestimation*. For example, if you are looking for logistic regression post-estimations, try typing **help logistic postestimation** in the command pane.