
When data do not completely meet the assumptions underlying the analysis of variance and/or when there are outliers or influential data points robust anova procedures can be used.
The most basic robust procedures are to analyze the data using regression with robust standard errors or to use the robust regression command rreg. Regress with the robust option is more appropriate designs with heterogeneity of variance. The rreg command is less useful except in situations with extreme outliers, although there may be other ways of dealing with these kinds of outliers. Of course, it is necessary to correctly code each of the categorical variables to use the regression approach.
One-Factor Designs
In addition to robust regression techniques, the fstar and wtest commands can be use with one-factor designs. Both fstar and wtest are more robust to heteroscedasticity that regular one-way anova.
Consider an example using write as the response variable and prog as the categorical variable.
use http://www.gseis.ucla.edu/courses/data/cr4new, clear
tabstat y, by(a) stat(n mean sd)
Summary for variables: y
by categories of: a
a | N mean sd
---------+------------------------------
1 | 8 3 1.511858
2 | 8 3.5 .9258201
3 | 8 4.25 1.035098
4 | 8 6.25 2.12132
---------+------------------------------
Total | 32 4.25 1.883716
----------------------------------------
histogram y, by(a) normal
anova y a
Number of obs = 32 R-squared = 0.4455
Root MSE = 1.476 Adj R-squared = 0.3860
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 49.00 3 16.3333333 7.50 0.0008
|
a | 49.00 3 16.3333333 7.50 0.0008
|
Residual | 61.00 28 2.17857143
-----------+----------------------------------------------------
Total | 110.00 31 3.5483871
xi: regress y i.a
i.a _Ia_1-4 (naturally coded; _Ia_1 omitted)
Source | SS df MS Number of obs = 32
-------------+------------------------------ F( 3, 28) = 7.50
Model | 49.00 3 16.3333333 Prob > F = 0.0008
Residual | 61.00 28 2.17857143 R-squared = 0.4455
-------------+------------------------------ Adj R-squared = 0.3860
Total | 110.00 31 3.5483871 Root MSE = 1.476
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ia_2 | .5 .7379992 0.68 0.504 -1.011723 2.011723
_Ia_3 | 1.25 .7379992 1.69 0.101 -.2617229 2.761723
_Ia_4 | 3.25 .7379992 4.40 0.000 1.738277 4.761723
_cons | 3 .5218443 5.75 0.000 1.931051 4.068949
------------------------------------------------------------------------------
test _Ia_2 _Ia_3 _Ia_4
( 1) _Ia_2 = 0.0
( 2) _Ia_3 = 0.0
( 3) _Ia_4 = 0.0
F( 3, 28) = 7.50
Prob > F = 0.0008
xi: regress y i.a, robust /* useful with heterogeneity */
i.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
Regression with robust standard errors Number of obs = 32
F( 3, 28) = 5.02
Prob > F = 0.0066
R-squared = 0.4455
Root MSE = 1.476
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ia_2 | .5 .6267832 0.80 0.432 -.7839071 1.783907
_Ia_3 | 1.25 .6477985 1.93 0.064 -.076955 2.576955
_Ia_4 | 3.25 .9209855 3.53 0.001 1.363447 5.136553
_cons | 3 .5345225 5.61 0.000 1.90508 4.09492
------------------------------------------------------------------------------
test _Ia_2 _Ia_3 _Ia_4
( 1) _Ia_2 = 0.0
( 2) _Ia_3 = 0.0
( 3) _Ia_4 = 0.0
F( 3, 28) = 5.02
Prob > F = 0.0066
xi: rreg y i.a /* useful with outliers */
i.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
Huber iteration 1: maximum difference in weights = .53333333
Huber iteration 2: maximum difference in weights = .11280178
Huber iteration 3: maximum difference in weights = .02450832
Biweight iteration 4: maximum difference in weights = .14902125
Biweight iteration 5: maximum difference in weights = .02171319
Biweight iteration 6: maximum difference in weights = .00926737
Robust regression estimates Number of obs = 32
F( 3, 28) = 7.51
Prob > F = 0.0008
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ia_2 | .6744035 .7076641 0.95 0.349 -.7751807 2.123988
_Ia_3 | 1.404021 .7076641 1.98 0.057 -.0455634 2.853605
_Ia_4 | 3.183657 .7076641 4.50 0.000 1.734072 4.633241
_cons | 2.825597 .5003941 5.65 0.000 1.800586 3.850607
------------------------------------------------------------------------------
test _Ia_2 _Ia_3 _Ia_4
( 1) _Ia_2 = 0.0
( 2) _Ia_3 = 0.0
( 3) _Ia_4 = 0.0
F( 3, 28) = 7.51
Prob > F = 0.0008
fstar y a /* available from ATS via the Internet */
----------------------------------------------------------------------
Dependent Variable is y and Independent Variable is a
Fstar( 3, 19.43) = 7.497, p= 0.0016
----------------------------------------------------------------------
wtest y a /* available from ATS via the Internet */
----------------------------------------------------------------------
Dependent Variable is y and Independent Variable is a
WStat( 3, 15.06) = 4.612, p= 0.0176
----------------------------------------------------------------------
Nonparametric TestWe can also try a nonparametric test, such as the Kruskal-Wallis test. The Kruskal-Wallis is the nonparametric analog of the one-way anova.
kwallis y, by(a)
Test: Equality of populations (Kruskal-Wallis test)
a _Obs _RankSum
1 8 78.50
2 8 104.00
3 8 142.50
4 8 203.00
chi-squared = 12.496 with 3 d.f.
probability = 0.0059
chi-squared with ties = 12.997 with 3 d.f.
probability = 0.0046
Bootstrap Standard Errors
xi: regress y i.a, vce(bootstrap, reps(200))
i.a _Ia_1-4 (naturally coded; _Ia_1 omitted)
(running regress on estimation sample)
Bootstrap replications (200)
Linear regression Number of obs = 32
Replications = 200
Wald chi2(3) = 11.80
Prob > chi2 = 0.0081
R-squared = 0.4455
Adj R-squared = 0.3860
Root MSE = 1.4760
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ia_2 | .5 .6584021 0.76 0.448 -.7904445 1.790444
_Ia_3 | 1.25 .7206891 1.73 0.083 -.1625246 2.662525
_Ia_4 | 3.25 1.00577 3.23 0.001 1.278726 5.221274
_cons | 3 .5637016 5.32 0.000 1.895165 4.104835
------------------------------------------------------------------------------
test _Ia_2 _Ia_3 _Ia_4
( 1) _Ia_2 = 0
( 2) _Ia_3 = 0
( 3) _Ia_4 = 0
chi2( 3) = 11.80
Prob > chi2 = 0.0081
Least Absolute DeviationMinimizes the absolute values of deviations from the median, which is why it is often known as median regression. We will run the example using bootstrap standard errors with 200 replications.
xi: bsqreg y i.a, reps(200)
i.a _Ia_1-4 (naturally coded; _Ia_1 omitted)
(fitting base model)
(bootstrapping)
Median regression, bootstrap(200) SEs Number of obs = 32
Raw sum of deviations 44 (about 4)
Min sum of deviations 32 Pseudo R2 = 0.2727
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ia_2 | 1 .8026182 1.25 0.223 -.6440889 2.644089
_Ia_3 | 1 .7716067 1.30 0.206 -.5805647 2.580565
_Ia_4 | 3 1.075819 2.79 0.009 .7962843 5.203716
_cons | 3 .62749 4.78 0.000 1.714645 4.285355
------------------------------------------------------------------------------
test _Ia_2 _Ia_3 _Ia_4
( 1) _Ia_2 = 0
( 2) _Ia_3 = 0
( 3) _Ia_4 = 0
F( 3, 28) = 2.59
Prob > F = 0.0725
Permutation TestsPermutation tests are based on Monte Carlo simulations and can be used to test the differences among the levels of the independent variable. For each repetition, the values of group variable are randomly permuted, the test statistic is computed, and a count is kept whether this value of the test statistic is more extreme than the observed test statistic.
Stata uses the permute command along with an ado program to perform permutation tests. Here is an example of a program that can be used with one-factor designs. It can be placed in the data directory and should be named panova1.ado.
program define panova1 version 8 args dv grp anova `dv' `grp' endHere is how it is used with cr4new.
permute a "panova1 y a" F=e(F), reps(1000)
command: panova1 y a
statistic: F = e(F)
permute var: a
Monte Carlo permutation statistics Number of obs = 32
Replications = 1000
------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
-------------+----------------------------------------------------------------
F | 7.497268 1 1000 0.0010 0.0010 .0000253 .0055589
------------------------------------------------------------------------------
Note: confidence interval is with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
Next we will modify the dataset so that the results are not significant and repeat the
permutation tests.
replace y=4 in 29/32
anova y a
Number of obs = 32 R-squared = 0.2220
Root MSE = 1.12401 Adj R-squared = 0.1386
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 10.09375 3 3.36458333 2.66 0.0673
|
a | 10.09375 3 3.36458333 2.66 0.0673
|
Residual | 35.375 28 1.26339286
-----------+----------------------------------------------------
Total | 45.46875 31 1.46673387
permute a "panova1 y a" F=e(F), reps(1000)
command: panova1 y a
statistic: F = e(F)
permute var: a
Monte Carlo permutation statistics Number of obs = 32
Replications = 1000
------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
-------------+----------------------------------------------------------------
F | 2.663133 78 1000 0.0780 0.0085 .0621412 .0963936
------------------------------------------------------------------------------
Note: confidence interval is with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
Two-Factor DesignsConsider a two-factor design using write as the response variable and female and prog as the categorical independent variables.
use http://www.gseis.ucla.edu/courses/data/hsb2, clear
table female prog, contents(freq mean write sd write) row col
--------------------------------------------------
| type of program
female | general academic vocation Total
----------+---------------------------------------
male | 21 47 23 91
| 49.14286 54.61702 41.82609 50.12088
| 10.36478 8.656622 8.003705 10.30516
|
female | 24 58 27 109
| 53.25 57.58621 50.96296 54.99083
| 8.205248 7.115672 8.341193 8.133716
|
Total | 45 105 50 200
| 51.33333 56.25714 46.76 52.775
| 9.397776 7.943343 9.318754 9.478586
--------------------------------------------------
histogram write, by(female prog) start(30) width(5) normal
)
anova write female prog female*prog
Number of obs = 200 R-squared = 0.2590
Root MSE = 8.26386 Adj R-squared = 0.2399
Source | Partial SS df MS F Prob > F
------------+----------------------------------------------------
Model | 4630.36091 5 926.072182 13.56 0.0000
|
female | 1261.85329 1 1261.85329 18.48 0.0000
prog | 3274.35082 2 1637.17541 23.97 0.0000
female*prog | 325.958189 2 162.979094 2.39 0.0946
|
Residual | 13248.5141 194 68.2913097
------------+----------------------------------------------------
Total | 17878.875 199 89.843593
xi2: regress write s.female*s.prog
s.female _Ifemale_0-1 (naturally coded; _Ifemale_0 omitted)
s.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
s.fem~e*s.prog _IfemXpro_#_# (coded as above)
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 5, 194) = 13.56
Model | 4630.36091 5 926.072182 Prob > F = 0.0000
Residual | 13248.5141 194 68.2913097 R-squared = 0.2590
-------------+------------------------------ Adj R-squared = 0.2399
Total | 17878.875 199 89.843593 Root MSE = 8.2639
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ifemale_1 | 5.404401 1.257262 4.30 0.000 2.924744 7.884059
_Iprog_2 | 4.905186 1.477149 3.32 0.001 1.991852 7.818519
_Iprog_3 | -4.801904 1.70264 -2.82 0.005 -8.159965 -1.443842
_IfemXpro_~2 | -1.137957 2.954299 -0.39 0.701 -6.964625 4.68871
_IfemXpro_~3 | 5.029733 3.40528 1.48 0.141 -1.68639 11.74586
_cons | 51.23086 .6286312 81.50 0.000 49.99103 52.47068
------------------------------------------------------------------------------
test _Ifemale_1
( 1) _Ifemale_1 = 0.0
F( 1, 194) = 18.48
Prob > F = 0.0000
test _Iprog_2 _Iprog_3
( 1) _Iprog_2 = 0.0
( 2) _Iprog_3 = 0.0
F( 2, 194) = 23.97
Prob > F = 0.0000
test _IfemXpro_1_2 _IfemXpro_1_3
( 1) _IfemXpro_1_2 = 0.0
( 2) _IfemXpro_1_3 = 0.0
F( 2, 194) = 2.39
Prob > F = 0.0946
xi2: regress write s.female*s.prog, robust
s.female _Ifemale_0-1 (naturally coded; _Ifemale_0 omitted)
s.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
s.fem~e*s.prog _IfemXpro_#_# (coded as above)
Regression with robust standard errors Number of obs = 200
F( 5, 194) = 15.07
Prob > F = 0.0000
R-squared = 0.2590
Root MSE = 8.2639
------------------------------------------------------------------------------
| Robust
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ifemale_1 | 5.404401 1.316228 4.11 0.000 2.808448 8.000355
_Iprog_2 | 4.905186 1.603702 3.06 0.003 1.742255 8.068116
_Iprog_3 | -4.801904 1.80962 -2.65 0.009 -8.370959 -1.232849
_IfemXpro_~2 | -1.137957 3.207405 -0.35 0.723 -7.463818 5.187903
_IfemXpro_~3 | 5.029733 3.61924 1.39 0.166 -2.108377 12.16784
_cons | 51.23086 .6581141 77.84 0.000 49.93288 52.52883
------------------------------------------------------------------------------
test _Ifemale_1
( 1) _Ifemale_1 = 0.0
F( 1, 194) = 16.86
Prob > F = 0.0000
test _Iprog_2 _Iprog_3
( 1) _Iprog_2 = 0.0
( 2) _Iprog_3 = 0.0
F( 2, 194) = 24.85
Prob > F = 0.0000
test _IfemXpro_1_2 _IfemXpro_1_3
( 1) _IfemXpro_1_2 = 0.0
( 2) _IfemXpro_1_3 = 0.0
F( 2, 194) = 2.48
Prob > F = 0.0867
xi2: rreg write s.female*s.prog
s.female _Ifemale_0-1 (naturally coded; _Ifemale_0 omitted)
s.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
s.fem~e*s.prog _IfemXpro_#_# (coded as above)
Huber iteration 1: maximum difference in weights = .43221816
Huber iteration 2: maximum difference in weights = .0522718
Huber iteration 3: maximum difference in weights = .02118098
Biweight iteration 4: maximum difference in weights = .15802072
Biweight iteration 5: maximum difference in weights = .00897314
Robust regression estimates Number of obs = 200
F( 5, 194) = 14.17
Prob > F = 0.0000
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ifemale_1 | 5.555252 1.304384 4.26 0.000 2.982657 8.127846
_Iprog_2 | 5.118503 1.532513 3.34 0.001 2.095978 8.141028
_Iprog_3 | -5.315333 1.766455 -3.01 0.003 -8.799254 -1.831412
_IfemXpro_~2 | -1.307885 3.065025 -0.43 0.670 -7.352935 4.737164
_IfemXpro_~3 | 5.590185 3.532909 1.58 0.115 -1.377657 12.55803
_cons | 51.42093 .6521921 78.84 0.000 50.13464 52.70723
------------------------------------------------------------------------------
test _Ifemale_1
( 1) _Ifemale_1 = 0.0
F( 1, 194) = 18.14
Prob > F = 0.0000
test _Iprog_2 _Iprog_3
( 1) _Iprog_2 = 0.0
( 2) _Iprog_3 = 0.0
F( 2, 194) = 25.60
Prob > F = 0.0000
test _IfemXpro_1_2 _IfemXpro_1_3
( 1) _IfemXpro_1_2 = 0.0
( 2) _IfemXpro_1_3 = 0.0
F( 2, 194) = 2.77
Prob > F = 0.0652
Linear Statistical Models Course
Phil Ender, 15may06, 4apr06, 2apr02