
This unit will cover a number of Stata commands that you have not seen before. Do not panic, this unit is primarily conceptual in nature. You do not have to learn all of the different procedures.
We begin with a fairly typical OLS regression analysis regressing api04 on meals, el, avg_ed and emer. This dataset has complete data on 4,702 schools from 834 school districts.
use http://www.gseis.ucla.edu/courses/data/api04, clear
keep if stype==5 /* elementary schools only */
regress api04 meals el avg_ed emer
Source | SS df MS Number of obs = 4702
-------------+------------------------------ F( 4, 4697) = 3106.99
Model | 30953004.9 4 7738251.22 Prob > F = 0.0000
Residual | 11698313.1 4697 2490.59253 R-squared = 0.7257
-------------+------------------------------ Adj R-squared = 0.7255
Total | 42651318 4701 9072.81812 Root MSE = 49.906
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.393737 .0484685 -28.76 0.000 -1.488758 -1.298716
el | .2083527 .0501284 4.16 0.000 .1100775 .3066279
avg_ed | 57.33541 1.885822 30.40 0.000 53.63831 61.03251
emer | -1.488105 .127398 -11.68 0.000 -1.737864 -1.238345
_cons | 652.3525 7.342538 88.85 0.000 637.9577 666.7473
------------------------------------------------------------------------------
keep if e(sample)
(199 observations deleted)
The problem with this analysis
is that we are treating these data as if the api scores were completely independent
from one school to the next. This is unlikely to be true since scores within school districts
are likely to be similar to one another.We can see how much of the variability is within district versus how much is between district by computing an intraclass correlation using the loneway command in Stata.
loneway api04 dnum
One-way Analysis of Variance for api04:
Number of obs = 4702
R-squared = 0.5586
Source SS df MS F Prob > F
-------------------------------------------------------------------------
Between dnum 23823906 833 28600.127 5.88 0.0000
Within dnum 18827412 3868 4867.4798
-------------------------------------------------------------------------
Total 42651318 4701 9072.8181
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.46603 0.03590 0.39565 0.53640
Estimated SD of dnum effect 65.17749
Estimated SD within dnum 69.76733
Est. reliability of a dnum mean 0.82981
(evaluated at n=5.59)
The intraclass correlation of .47 is fairly substantial and puts into question whether the
coefficients and standard errors from our original regression model are correct. The first
thing that we can try is to rerun the analysis using the cluster option. The cluster
option yields the same regression coefficients but allows for differences in the
variance/standard errors due to arbitrary intra-group correlation.
regress api04 meals el avg_ed emer, cluster(dnum)
Regression with robust standard errors Number of obs = 4702
F( 4, 833) = 472.33
Prob > F = 0.0000
R-squared = 0.7257
Number of clusters (dnum) = 834 Root MSE = 49.906
------------------------------------------------------------------------------
| Robust
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.393737 .132543 -10.52 0.000 -1.653895 -1.13358
el | .2083527 .109512 1.90 0.057 -.0065991 .4233045
avg_ed | 57.33541 5.590804 10.26 0.000 46.36169 68.30913
emer | -1.488105 .2979344 -4.99 0.000 -2.072895 -.9033141
_cons | 652.3525 22.19378 29.39 0.000 608.7902 695.9148
------------------------------------------------------------------------------
Note that the variable el is nolonger significant at the .05 level.The analysis using the cluster option works but it is kind a quick-and-dirty solution that would benefit from a more precise solution.
An alternative to using the cluster option is to include dummy coded variables for school district. The advantage of dummy coding district is that it allows for differences in the average level of across across districts in addition to adjusting the standard errors taking into account the specific intra-group correlation. However, regression with 833 dummy variables for school districts is both slow and memory intensive (it requires Stata SE). The alternative is to use the areg command which is logicaly equivalent to the dummy variable approach.
areg api04 meals el avg_ed emer, absorb(dnum)
Number of obs = 4702
F( 4, 3864) = 1802.16
Prob > F = 0.0000
R-squared = 0.8460
Adj R-squared = 0.8126
Root MSE = 41.235
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.694456 .0609289 -27.81 0.000 -1.813912 -1.575
el | -.1155478 .0669 -1.73 0.084 -.2467105 .0156149
avg_ed | 37.07292 2.302082 16.10 0.000 32.55951 41.58633
emer | -1.613581 .1441918 -11.19 0.000 -1.896281 -1.330882
_cons | 733.0973 8.945856 81.95 0.000 715.5582 750.6363
-------------+----------------------------------------------------------------
dnum | F(833, 3864) = 3.621 0.000 (834 categories)
In this analysis both the coefficients and the standard errors differ from the original
regression model.It is also possible to run the areg coomand with the robust option.
areg api04 meals el avg_ed emer, absorb(dnum) robust
Regression with robust standard errors Number of obs = 4702
F( 4, 3864) = 1373.77
Prob > F = 0.0000
R-squared = 0.8460
Adj R-squared = 0.8126
Root MSE = 41.235
------------------------------------------------------------------------------
| Robust
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.694456 .1059713 -15.99 0.000 -1.902221 -1.486691
el | -.1155478 .1199678 -0.96 0.336 -.3507539 .1196584
avg_ed | 37.07292 3.962098 9.36 0.000 29.30492 44.84092
emer | -1.613581 .2141023 -7.54 0.000 -2.033346 -1.193817
_cons | 733.0973 15.63896 46.88 0.000 702.4359 763.7587
-------------+----------------------------------------------------------------
dnum | absorbed (834 categories)
As you can see the coefficients are the same as for the first areg model but the standard
errors have been adjusted.We will follow this analysis with fixed-effects (within) cross-sectional time-series model using xtreg.
xtreg api04 meals el avg_ed emer, i(dnum) fe
Fixed-effects (within) regression Number of obs = 4702
Group variable (i): dnum Number of groups = 834
R-sq: within = 0.6510 Obs per group: min = 1
between = 0.6987 avg = 5.6
overall = 0.7172 max = 373
F(4,3864) = 1802.16
corr(u_i, Xb) = -0.0320 Prob > F = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.694456 .0609289 -27.81 0.000 -1.813912 -1.575
el | -.1155478 .0669 -1.73 0.084 -.2467105 .0156149
avg_ed | 37.07292 2.302082 16.10 0.000 32.55951 41.58633
emer | -1.613581 .1441918 -11.19 0.000 -1.896281 -1.330882
_cons | 733.0973 8.945856 81.95 0.000 715.5582 750.6363
-------------+----------------------------------------------------------------
sigma_u | 47.038086
sigma_e | 41.235374
rho | .5654528 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(833, 3864) = 3.62 Prob > F = 0.0000
This analysis is exactly the same as the areg without the cluster option.We will follow this up with a between-effects xtreg model.
xtreg api04 meals el avg_ed emer, i(dnum) be
Between regression (regression on group means) Number of obs = 4702
Group variable (i): dnum Number of groups = 834
R-sq: within = 0.6419 Obs per group: min = 1
between = 0.7072 avg = 5.6
overall = 0.7203 max = 373
F(4,829) = 500.49
sd(u_i + avg(e_i.))= 46.47706 Prob > F = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.418722 .1109574 -12.79 0.000 -1.636512 -1.200931
el | .157309 .1109758 1.42 0.157 -.0605176 .3751356
avg_ed | 50.64735 4.416476 11.47 0.000 41.97856 59.31614
emer | -2.572258 .2793599 -9.21 0.000 -3.120594 -2.023922
_cons | 668.0286 17.02028 39.25 0.000 634.6206 701.4365
------------------------------------------------------------------------------
The regression coefficients, standard errors and the R-squared between can also be obtained by
generating a mean score for each variable for each district and then running an OLS regression
with one observation per district.
sort dnum id
by dnum: generate i = _n
egen a = mean(api04), by(dnum)
egen b = mean(meals), by(dnum)
egen c = mean(el), by(dnum)
egen d = mean(avg_ed), by(dnum)
egen e = mean(emer), by(dnum)
regress a b c d e if i==1
Source | SS df MS Number of obs = 834
-------------+------------------------------ F( 4, 829) = 500.49
Model | 4324441.91 4 1081110.48 Prob > F = 0.0000
Residual | 1790736.82 829 2160.11679 R-squared = 0.7072
-------------+------------------------------ Adj R-squared = 0.7058
Total | 6115178.73 833 7341.15094 Root MSE = 46.477
------------------------------------------------------------------------------
a | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
b | -1.418722 .1109574 -12.79 0.000 -1.636512 -1.200931
c | .157309 .1109758 1.42 0.157 -.0605177 .3751356
d | 50.64735 4.416476 11.47 0.000 41.97856 59.31614
e | -2.572258 .2793599 -9.21 0.000 -3.120594 -2.023922
_cons | 668.0286 17.02028 39.25 0.000 634.6206 701.4365
------------------------------------------------------------------------------
Since we have the mean values for each district let's subtract the district mean and add back in
the grand mean and then do OLS regression with these adjusted scores.
sum api04, meanonly
generate a2 = api04 - a + r(mean)
sum meals, meanonly
generate b2 = meals - b + r(mean)
sum el, meanonly
generate c2 = el - c + r(mean)
sum avg_ed, meanonly
generate d2 = avg_ed - d + r(mean)
sum emer, meanonly
generate e2 = emer - e + r(mean)
regress a2 b2 c2 d2 e2
Source | SS df MS Number of obs = 4702
-------------+------------------------------ F( 4, 4697) = 2190.67
Model | 12257236 4 3064308.99 Prob > F = 0.0000
Residual | 6570175.85 4697 1398.80261 R-squared = 0.6510
-------------+------------------------------ Adj R-squared = 0.6507
Total | 18827411.8 4701 4004.98018 Root MSE = 37.401
------------------------------------------------------------------------------
a3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
b3 | -1.694456 .0552626 -30.66 0.000 -1.802797 -1.586115
c3 | -.1155478 .0606784 -1.90 0.057 -.234506 .0034105
d3 | 37.07292 2.087993 17.76 0.000 32.97947 41.16636
e3 | -1.613581 .1307822 -12.34 0.000 -1.869976 -1.357187
_cons | 733.0973 8.11391 90.35 0.000 717.1902 749.0043
This analysis gives us the same coefficients as the areg and fixed-effects xtreg but
not the same standard errors. We also get the same R-squared as within for the fixed-effect
model. The reason that we don't get the same standard errors is that we haven't adjusted
the degrees of freedom of the variances and covariances to account for the 834
district means we have estimated.Next, we will run a random-effects xtreg model. The random-effects model provides a gls solution giving a matrix weighted average of the between-effects and within-effects models.
xtreg api04 meals el avg_ed emer, i(dnum) re
Random-effects GLS regression Number of obs = 4702
Group variable (i): dnum Number of groups = 834
R-sq: within = 0.6506 Obs per group: min = 1
between = 0.7019 avg = 5.6
overall = 0.7197 max = 373
Random effects u_i ~ Gaussian Wald chi2(4) = 9457.09
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.634401 .0530331 -30.82 0.000 -1.738344 -1.530458
el | -.0509669 .0564293 -0.90 0.366 -.1615662 .0596325
avg_ed | 41.01061 2.031122 20.19 0.000 37.02969 44.99154
emer | -1.784759 .1288192 -13.85 0.000 -2.03724 -1.532278
_cons | 709.7991 7.898683 89.86 0.000 694.318 725.2803
-------------+----------------------------------------------------------------
sigma_u | 34.55538
sigma_e | 41.235374
rho | .41254207 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Continuing with the random-effects model, let's try a maximum likelihood solution.
xtreg api04 meals el avg_ed emer, i(dnum) mle nolog
Random-effects ML regression Number of obs = 4702
Group variable (i): dnum Number of groups = 834
Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 5.6
max = 373
LR chi2(4) = 5174.63
Log likelihood = -24619.092 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.623472 .052821 -30.74 0.000 -1.726999 -1.519944
el | -.0407596 .0559112 -0.73 0.466 -.1503435 .0688243
avg_ed | 41.7457 2.037004 20.49 0.000 37.75324 45.73815
emer | -1.805145 .128828 -14.01 0.000 -2.057643 -1.552647
_cons | 707.5397 7.888927 89.69 0.000 692.0777 723.0017
-------------+----------------------------------------------------------------
/sigma_u | 30.15691 1.387128 21.74 0.000 27.43819 32.87563
/sigma_e | 41.61433 .4750977 87.59 0.000 40.68315 42.5455
-------------+----------------------------------------------------------------
rho | .3443291 .0224527 .3014606 .389298
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)= 871.45 Prob>=chibar2 = 0.000
And next, a population averaged GEE random-effects model.
xtreg api04 meals el avg_ed emer, i(dnum) pa nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) nolog */
GEE population-averaged model Number of obs = 4702
Group variable: dnum Number of groups = 834
Link: identity Obs per group: min = 1
Family: Gaussian avg = 5.6
Correlation: exchangeable max = 373
Wald chi2(4) = 14419.82
Scale parameter: 2643.464 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.661276 .0428127 -38.80 0.000 -1.745187 -1.577365
el | -.0781737 .046122 -1.69 0.090 -.168571 .0122237
avg_ed | 39.2176 1.632392 24.02 0.000 36.01818 42.41703
emer | -1.721368 .1026995 -16.76 0.000 -1.922656 -1.520081
_cons | 715.2624 6.407773 111.62 0.000 702.7034 727.8214
------------------------------------------------------------------------------
We can also run the population averaged model with the robust option.
xtreg api04 meals el avg_ed emer, i(dnum) pa robust nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) robust nolog */
GEE population-averaged model Number of obs = 4702
Group variable: dnum Number of groups = 834
Link: identity Obs per group: min = 1
Family: Gaussian avg = 5.6
Correlation: exchangeable max = 373
Wald chi2(4) = 4064.26
Scale parameter: 2643.464 Prob > chi2 = 0.0000
(standard errors adjusted for clustering on dnum)
------------------------------------------------------------------------------
| Semi-robust
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.661276 .1999276 -8.31 0.000 -2.053127 -1.269425
el | -.0781737 .1372211 -0.57 0.569 -.347122 .1907747
avg_ed | 39.2176 10.44393 3.76 0.000 18.74787 59.68734
emer | -1.721368 .2070774 -8.31 0.000 -2.127233 -1.315504
_cons | 715.2624 40.04044 17.86 0.000 636.7846 793.7402
------------------------------------------------------------------------------
Finally, just for the fun of it we will use svyreg with clusting even though the data
are not from a complex survey sampling design.
svyset, psu(dnum) psu is dnum svyreg api04 meals el avg_ed emer Survey linear regression pweight:This analysis is the same as the OLS regression with the cluster option.Number of obs = 4702 Strata: Number of strata = 1 PSU: dnum Number of PSUs = 834 Population size = 4702 F( 4, 830) = 471.03 Prob > F = 0.0000 R-squared = 0.7257 ------------------------------------------------------------------------------ api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meals | -1.393737 .1324866 -10.52 0.000 -1.653784 -1.133691 el | .2083527 .1094653 1.90 0.057 -.0065076 .423213 avg_ed | 57.33541 5.588425 10.26 0.000 46.36636 68.30446 emer | -1.488105 .2978076 -5.00 0.000 -2.072646 -.903563 _cons | 652.3525 22.18434 29.41 0.000 608.8087 695.8963 ------------------------------------------------------------------------------
Collectively, these analyses provide a range of options for analyzing clustered data in Stata. There is no need to use a multilevel data analysis program for these data since all of the data are collected at the school level and no cross level hypotheses are being tested. Results identical to xtreg with the mle option were obtained using SAS proc mixed.
Phil Ender, 11nov04