Ed230B/C

Regression with Clustered Data


This unit will cover a number of Stata commands that you have not seen before. Do not panic, this unit is primarily conceptual in nature. You do not have to learn all of the different procedures.

We begin with a fairly typical OLS regression analysis regressing api04 on meals, el, avg_ed and emer. This dataset has complete data on 4,702 schools from 834 school districts.

use http://www.gseis.ucla.edu/courses/data/api04, clear

keep if stype==5  /* elementary schools only */

regress api04 meals el avg_ed emer


      Source |       SS       df       MS              Number of obs =    4702
-------------+------------------------------           F(  4,  4697) = 3106.99
       Model |  30953004.9     4  7738251.22           Prob > F      =  0.0000
    Residual |  11698313.1  4697  2490.59253           R-squared     =  0.7257
-------------+------------------------------           Adj R-squared =  0.7255
       Total |    42651318  4701  9072.81812           Root MSE      =  49.906
       
------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.393737   .0484685   -28.76   0.000    -1.488758   -1.298716
          el |   .2083527   .0501284     4.16   0.000     .1100775    .3066279
      avg_ed |   57.33541   1.885822    30.40   0.000     53.63831    61.03251
        emer |  -1.488105    .127398   -11.68   0.000    -1.737864   -1.238345
       _cons |   652.3525   7.342538    88.85   0.000     637.9577    666.7473
------------------------------------------------------------------------------

keep if e(sample)
(199 observations deleted)
The problem with this analysis is that we are treating these data as if the api scores were completely independent from one school to the next. This is unlikely to be true since scores within school districts are likely to be similar to one another.

We can see how much of the variability is within district versus how much is between district by computing an intraclass correlation using the loneway command in Stata.

loneway api04 dnum

                    One-way Analysis of Variance for api04: 

                                              Number of obs =      4702
                                                  R-squared =    0.5586

    Source                SS         df      MS            F     Prob > F
-------------------------------------------------------------------------
Between dnum            23823906    833    28600.127      5.88     0.0000
Within dnum             18827412   3868    4867.4798
-------------------------------------------------------------------------
Total                   42651318   4701    9072.8181

         Intraclass       Asy.        
         correlation      S.E.       [95% Conf. Interval]
         ------------------------------------------------
            0.46603     0.03590       0.39565     0.53640

         Estimated SD of dnum effect             65.17749
         Estimated SD within dnum                69.76733
         Est. reliability of a dnum mean          0.82981
              (evaluated at n=5.59)
The intraclass correlation of .47 is fairly substantial and puts into question whether the coefficients and standard errors from our original regression model are correct. The first thing that we can try is to rerun the analysis using the cluster option. The cluster option yields the same regression coefficients but allows for differences in the variance/standard errors due to arbitrary intra-group correlation.
regress api04 meals el avg_ed emer, cluster(dnum)

Regression with robust standard errors                 Number of obs =    4702
                                                       F(  4,   833) =  472.33
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.7257
Number of clusters (dnum) = 834                        Root MSE      =  49.906

------------------------------------------------------------------------------
             |               Robust
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.393737    .132543   -10.52   0.000    -1.653895    -1.13358
          el |   .2083527    .109512     1.90   0.057    -.0065991    .4233045
      avg_ed |   57.33541   5.590804    10.26   0.000     46.36169    68.30913
        emer |  -1.488105   .2979344    -4.99   0.000    -2.072895   -.9033141
       _cons |   652.3525   22.19378    29.39   0.000     608.7902    695.9148
------------------------------------------------------------------------------
Note that the variable el is nolonger significant at the .05 level.

The analysis using the cluster option works but it is kind a quick-and-dirty solution that would benefit from a more precise solution.

An alternative to using the cluster option is to include dummy coded variables for school district. The advantage of dummy coding district is that it allows for differences in the average level of across across districts in addition to adjusting the standard errors taking into account the specific intra-group correlation. However, regression with 833 dummy variables for school districts is both slow and memory intensive (it requires Stata SE). The alternative is to use the areg command which is logicaly equivalent to the dummy variable approach.

areg api04 meals el avg_ed emer, absorb(dnum)

                                                       Number of obs =    4702
                                                       F(  4,  3864) = 1802.16
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.8460
                                                       Adj R-squared =  0.8126
                                                       Root MSE      =  41.235
------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.694456   .0609289   -27.81   0.000    -1.813912      -1.575
          el |  -.1155478      .0669    -1.73   0.084    -.2467105    .0156149
      avg_ed |   37.07292   2.302082    16.10   0.000     32.55951    41.58633
        emer |  -1.613581   .1441918   -11.19   0.000    -1.896281   -1.330882
       _cons |   733.0973   8.945856    81.95   0.000     715.5582    750.6363
-------------+----------------------------------------------------------------
        dnum |      F(833, 3864) =      3.621   0.000         (834 categories)
In this analysis both the coefficients and the standard errors differ from the original regression model.

It is also possible to run the areg coomand with the robust option.

areg api04 meals el avg_ed emer, absorb(dnum) robust

Regression with robust standard errors                 Number of obs =    4702
                                                       F(  4,  3864) = 1373.77
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.8460
                                                       Adj R-squared =  0.8126
                                                       Root MSE      =  41.235
------------------------------------------------------------------------------
             |               Robust
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.694456   .1059713   -15.99   0.000    -1.902221   -1.486691
          el |  -.1155478   .1199678    -0.96   0.336    -.3507539    .1196584
      avg_ed |   37.07292   3.962098     9.36   0.000     29.30492    44.84092
        emer |  -1.613581   .2141023    -7.54   0.000    -2.033346   -1.193817
       _cons |   733.0973   15.63896    46.88   0.000     702.4359    763.7587
-------------+----------------------------------------------------------------
        dnum |   absorbed                                     (834 categories)
As you can see the coefficients are the same as for the first areg model but the standard errors have been adjusted.

We will follow this analysis with fixed-effects (within) cross-sectional time-series model using xtreg.

xtreg api04 meals el avg_ed emer, i(dnum) fe

Fixed-effects (within) regression               Number of obs      =      4702
Group variable (i): dnum                        Number of groups   =       834

R-sq:  within  = 0.6510                         Obs per group: min =         1
       between = 0.6987                                        avg =       5.6
       overall = 0.7172                                        max =       373

                                                F(4,3864)          =   1802.16
corr(u_i, Xb)  = -0.0320                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.694456   .0609289   -27.81   0.000    -1.813912      -1.575
          el |  -.1155478      .0669    -1.73   0.084    -.2467105    .0156149
      avg_ed |   37.07292   2.302082    16.10   0.000     32.55951    41.58633
        emer |  -1.613581   .1441918   -11.19   0.000    -1.896281   -1.330882
       _cons |   733.0973   8.945856    81.95   0.000     715.5582    750.6363
-------------+----------------------------------------------------------------
     sigma_u |  47.038086
     sigma_e |  41.235374
         rho |   .5654528   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(833, 3864) =     3.62           Prob > F = 0.0000
This analysis is exactly the same as the areg without the cluster option.

We will follow this up with a between-effects xtreg model.

xtreg api04 meals el avg_ed emer, i(dnum) be

Between regression (regression on group means)  Number of obs      =      4702
Group variable (i): dnum                        Number of groups   =       834

R-sq:  within  = 0.6419                         Obs per group: min =         1
       between = 0.7072                                        avg =       5.6
       overall = 0.7203                                        max =       373

                                                F(4,829)           =    500.49
sd(u_i + avg(e_i.))=  46.47706                  Prob > F           =    0.0000

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.418722   .1109574   -12.79   0.000    -1.636512   -1.200931
          el |    .157309   .1109758     1.42   0.157    -.0605176    .3751356
      avg_ed |   50.64735   4.416476    11.47   0.000     41.97856    59.31614
        emer |  -2.572258   .2793599    -9.21   0.000    -3.120594   -2.023922
       _cons |   668.0286   17.02028    39.25   0.000     634.6206    701.4365
------------------------------------------------------------------------------
The regression coefficients, standard errors and the R-squared between can also be obtained by generating a mean score for each variable for each district and then running an OLS regression with one observation per district.
sort dnum id
by dnum: generate i = _n
egen a = mean(api04), by(dnum)
egen b = mean(meals), by(dnum)
egen c = mean(el), by(dnum)
egen d = mean(avg_ed), by(dnum)
egen e = mean(emer), by(dnum)

regress a b c d e if i==1

      Source |       SS       df       MS              Number of obs =     834
-------------+------------------------------           F(  4,   829) =  500.49
       Model |  4324441.91     4  1081110.48           Prob > F      =  0.0000
    Residual |  1790736.82   829  2160.11679           R-squared     =  0.7072
-------------+------------------------------           Adj R-squared =  0.7058
       Total |  6115178.73   833  7341.15094           Root MSE      =  46.477

------------------------------------------------------------------------------
           a |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           b |  -1.418722   .1109574   -12.79   0.000    -1.636512   -1.200931
           c |    .157309   .1109758     1.42   0.157    -.0605177    .3751356
           d |   50.64735   4.416476    11.47   0.000     41.97856    59.31614
           e |  -2.572258   .2793599    -9.21   0.000    -3.120594   -2.023922
       _cons |   668.0286   17.02028    39.25   0.000     634.6206    701.4365
------------------------------------------------------------------------------
Since we have the mean values for each district let's subtract the district mean and add back in the grand mean and then do OLS regression with these adjusted scores.
sum api04, meanonly
generate a2 = api04 - a + r(mean)
sum meals, meanonly
generate b2 = meals - b + r(mean)
sum el, meanonly
generate c2 = el - c + r(mean)
sum avg_ed, meanonly
generate d2 = avg_ed - d + r(mean)
sum emer, meanonly
generate e2 = emer - e + r(mean)

regress a2 b2 c2 d2 e2

      Source |       SS       df       MS              Number of obs =    4702
-------------+------------------------------           F(  4,  4697) = 2190.67
       Model |    12257236     4  3064308.99           Prob > F      =  0.0000
    Residual |  6570175.85  4697  1398.80261           R-squared     =  0.6510
-------------+------------------------------           Adj R-squared =  0.6507
       Total |  18827411.8  4701  4004.98018           Root MSE      =  37.401

------------------------------------------------------------------------------
          a3 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          b3 |  -1.694456   .0552626   -30.66   0.000    -1.802797   -1.586115
          c3 |  -.1155478   .0606784    -1.90   0.057     -.234506    .0034105
          d3 |   37.07292   2.087993    17.76   0.000     32.97947    41.16636
          e3 |  -1.613581   .1307822   -12.34   0.000    -1.869976   -1.357187
       _cons |   733.0973    8.11391    90.35   0.000     717.1902    749.0043
This analysis gives us the same coefficients as the areg and fixed-effects xtreg but not the same standard errors. We also get the same R-squared as within for the fixed-effect model. The reason that we don't get the same standard errors is that we haven't adjusted the degrees of freedom of the variances and covariances to account for the 834 district means we have estimated.

Next, we will run a random-effects xtreg model. The random-effects model provides a gls solution giving a matrix weighted average of the between-effects and within-effects models.

xtreg api04 meals el avg_ed emer, i(dnum) re

Random-effects GLS regression                   Number of obs      =      4702
Group variable (i): dnum                        Number of groups   =       834

R-sq:  within  = 0.6506                         Obs per group: min =         1
       between = 0.7019                                        avg =       5.6
       overall = 0.7197                                        max =       373

Random effects u_i ~ Gaussian                   Wald chi2(4)       =   9457.09
corr(u_i, X)       = 0 (assumed)                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.634401   .0530331   -30.82   0.000    -1.738344   -1.530458
          el |  -.0509669   .0564293    -0.90   0.366    -.1615662    .0596325
      avg_ed |   41.01061   2.031122    20.19   0.000     37.02969    44.99154
        emer |  -1.784759   .1288192   -13.85   0.000     -2.03724   -1.532278
       _cons |   709.7991   7.898683    89.86   0.000      694.318    725.2803
-------------+----------------------------------------------------------------
     sigma_u |   34.55538
     sigma_e |  41.235374
         rho |  .41254207   (fraction of variance due to u_i)
------------------------------------------------------------------------------
Continuing with the random-effects model, let's try a maximum likelihood solution.
xtreg api04 meals el avg_ed emer, i(dnum) mle nolog

Random-effects ML regression                    Number of obs      =      4702
Group variable (i): dnum                        Number of groups   =       834

Random effects u_i ~ Gaussian                   Obs per group: min =         1
                                                               avg =       5.6
                                                               max =       373

                                                LR chi2(4)         =   5174.63
Log likelihood  = -24619.092                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.623472    .052821   -30.74   0.000    -1.726999   -1.519944
          el |  -.0407596   .0559112    -0.73   0.466    -.1503435    .0688243
      avg_ed |    41.7457   2.037004    20.49   0.000     37.75324    45.73815
        emer |  -1.805145    .128828   -14.01   0.000    -2.057643   -1.552647
       _cons |   707.5397   7.888927    89.69   0.000     692.0777    723.0017
-------------+----------------------------------------------------------------
    /sigma_u |   30.15691   1.387128    21.74   0.000     27.43819    32.87563
    /sigma_e |   41.61433   .4750977    87.59   0.000     40.68315     42.5455
-------------+----------------------------------------------------------------
         rho |   .3443291   .0224527                      .3014606     .389298
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=  871.45 Prob>=chibar2 = 0.000
And next, a population averaged GEE random-effects model.
xtreg api04 meals el avg_ed emer, i(dnum) pa nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) nolog */

GEE population-averaged model                   Number of obs      =      4702
Group variable:                       dnum      Number of groups   =       834
Link:                             identity      Obs per group: min =         1
Family:                           Gaussian                     avg =       5.6
Correlation:                  exchangeable                     max =       373
                                                Wald chi2(4)       =  14419.82
Scale parameter:                  2643.464      Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.661276   .0428127   -38.80   0.000    -1.745187   -1.577365
          el |  -.0781737    .046122    -1.69   0.090     -.168571    .0122237
      avg_ed |    39.2176   1.632392    24.02   0.000     36.01818    42.41703
        emer |  -1.721368   .1026995   -16.76   0.000    -1.922656   -1.520081
       _cons |   715.2624   6.407773   111.62   0.000     702.7034    727.8214
------------------------------------------------------------------------------
We can also run the population averaged model with the robust option.
xtreg api04 meals el avg_ed emer, i(dnum) pa robust nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) robust nolog */

GEE population-averaged model                   Number of obs      =      4702
Group variable:                       dnum      Number of groups   =       834
Link:                             identity      Obs per group: min =         1
Family:                           Gaussian                     avg =       5.6
Correlation:                  exchangeable                     max =       373
                                                Wald chi2(4)       =   4064.26
Scale parameter:                  2643.464      Prob > chi2        =    0.0000

                             (standard errors adjusted for clustering on dnum)
------------------------------------------------------------------------------
             |             Semi-robust
       api04 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.661276   .1999276    -8.31   0.000    -2.053127   -1.269425
          el |  -.0781737   .1372211    -0.57   0.569     -.347122    .1907747
      avg_ed |    39.2176   10.44393     3.76   0.000     18.74787    59.68734
        emer |  -1.721368   .2070774    -8.31   0.000    -2.127233   -1.315504
       _cons |   715.2624   40.04044    17.86   0.000     636.7846    793.7402
------------------------------------------------------------------------------
Finally, just for the fun of it we will use svyreg with clusting even though the data are not from a complex survey sampling design.
svyset, psu(dnum)
psu is dnum

svyreg api04 meals el avg_ed emer

Survey linear regression

pweight:                                    Number of obs    =      4702
Strata:                                      Number of strata =         1
PSU:      dnum                                    Number of PSUs   =       834
                                                  Population size  =      4702
                                                  F(   4,    830)  =    471.03
                                                  Prob > F         =    0.0000
                                                  R-squared        =    0.7257

------------------------------------------------------------------------------
       api04 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -1.393737   .1324866   -10.52   0.000    -1.653784   -1.133691
          el |   .2083527   .1094653     1.90   0.057    -.0065076     .423213
      avg_ed |   57.33541   5.588425    10.26   0.000     46.36636    68.30446
        emer |  -1.488105   .2978076    -5.00   0.000    -2.072646    -.903563
       _cons |   652.3525   22.18434    29.41   0.000     608.8087    695.8963
------------------------------------------------------------------------------
This analysis is the same as the OLS regression with the cluster option.

Collectively, these analyses provide a range of options for analyzing clustered data in Stata. There is no need to use a multilevel data analysis program for these data since all of the data are collected at the school level and no cross level hypotheses are being tested. Results identical to xtreg with the mle option were obtained using SAS proc mixed.


UCLA Department of Education

Phil Ender, 11nov04