Ed230B/C

Categorical Predictors


Regression Analysis So Far...

  • Regression analysis so far has invloved only continuous variables or at the least quasi-interval scaled variables.
  • There are many variables of interest that are categorical in nature.
  • Regression can include the use of categorical variables which, in turn, involves the use of coding to indicate group membership.
  • Qualitative regressors; categorical predictor variables; nominal predictor variables.

    Coding Methods for Categorical Variables

  • Categorical variables require some system for coding observations as to their group member ship. At this time we will introduce two methods for coding categorical variables:
  • Later, in Analysis of Variance, we will introduce Orthogonal Coding of categorical variables.

    Consider the Following 4 Group Design:

    Levela1 a2a3a4Total
    1
    3
    2
    2
    2
    3
    4
    3
    5
    6
    4
    5
    10
    10
    9
    11
    Mean2.03.05.010.05.0

    Dummy Coding

  • Use only 1's and 0's.
  • For k groups, use k-1 coded vectors.
  • 1 indicates group membership
  • Often the control group is coded with all 0's.
  • Each coded column is one degree of freedom.
  • Dummy variables are sometimes called indicator variables.

    Example Using Dummy Coding

     y  grp x1  x2  x3
     1   1  1   0   0 
     3   1  1   0   0
     2   1  1   0   0
     2   1  1   0   0
     2   2  0   1   0
     3   2  0   1   0
     4   2  0   1   0
     3   2  0   1   0
     5   3  0   0   1
     6   3  0   0   1
     4   3  0   0   1
     5   3  0   0   1
    10   4  0   0   0
    10   4  0   0   0
     9   4  0   0   0
    11   4  0   0   0
    

    Regression Analysis Using Dummy Coding

    regress y x1 x2 x3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          x1 |         -8   .5773503    -13.856   0.000      -9.257938   -6.742062
          x2 |         -7   .5773503    -12.124   0.000      -8.257938   -5.742062
          x3 |         -5   .5773503     -8.660   0.000      -6.257938   -3.742062
       _cons |         10   .4082483     24.495   0.000       9.110503     10.8895
    ------------------------------------------------------------------------------
    
    

    Interpretation of Coefficients

  • The constant, b0, is equal to the mean of the group coded with all 0's, i.e., Group 4.
  • b1 is equal to the difference between the mean for Group 1 and Group 4.
  • b2 is equal to the difference between the mean for Group 2 and Group 4.
  • b3 is equal to the difference between the mean for Group 3 and Group 4.
  • The t-test for each coefficient tests the difference between the group coded with 1's and the group coded with all 0's.

    Effect Coding

  • Use only 1's, 0's and -1's.
  • For k groups, use k-1 coded vectors.
  • 1 indicates group membership
  • Usually the control group is coded with all -1's.
  • Each coded column is one degree of freedom.

    Example Using Effect Coding

     y  grp x1  x2  x3
     1   1  1   0   0 
     3   1  1   0   0
     2   1  1   0   0
     2   1  1   0   0
     2   2  0   1   0
     3   2  0   1   0
     4   2  0   1   0
     3   2  0   1   0
     5   3  0   0   1
     6   3  0   0   1
     4   3  0   0   1
     5   3  0   0   1
    10   4 -1  -1  -1
    10   4 -1  -1  -1
     9   4 -1  -1  -1
    11   4 -1  -1  -1
    

    The Linear Model

    Yij = m + aj + ei(j)

    Where aj represents the treatment effect of the jth group.

    Regression Analysis Using Effect Coding

    regress y x1 x2 x3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          x1 |         -3   .3535534     -8.485   0.000      -3.770327   -2.229673
          x2 |         -2   .3535534     -5.657   0.000      -2.770327   -1.229673
          x3 |          0   .3535534      0.000   1.000      -.7703266    .7703266
       _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
    ------------------------------------------------------------------------------
    

    Interpretation of Coefficients

  • The constant is equal to the grand mean of the dependent variable.
  • b1 is equal to the difference between the mean for Group 1 and the grand mean, i.e., treatment effect for Group 1.
  • b2 is equal to the difference between the mean for Group 2 and the grand mean, i.e., treatment effect for Group 2.
  • b3 is equal to the difference between the mean for Group 3 and the grand mean, i.e., treatment effect for Group 3.
  • The t-test for each coefficient tests whether the treatment effect for that group is significant.
  • The treatment effect for the group coded with -1's is -Sbk, in this case, -(-3-2+0) = 5

    F-ratio Using R2

    An example using hsb2

    Let's analyze the hsb2 data for the variable program type (prog) using write as the dependent variable. We will dummy code prog using the tabulate command with the generate option to create the dummy variables for us automatically.

    use http://www.gseis.ucla.edu/courses/data/hsb2
    
    tab prog, gen(prog)
    
        type of |
        program |      Freq.     Percent        Cum.
    ------------+-----------------------------------
        general |         45       22.50       22.50
       academic |        105       52.50       75.00
       vocation |         50       25.00      100.00
    ------------+-----------------------------------
          Total |        200      100.00ed)
    
    regress write prog2 prog3
    
      Source |       SS       df       MS                  Number of obs =     200
    ---------+------------------------------               F(  2,   197) =   21.27
       Model |  3175.69786     2  1587.84893               Prob > F      =  0.0000
    Residual |  14703.1771   197   74.635417               R-squared     =  0.1776
    ---------+------------------------------               Adj R-squared =  0.1693
       Total |   17878.875   199   89.843593               Root MSE      =  8.6392
    
    ------------------------------------------------------------------------------
       write |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
       prog2 |    4.92381   1.539279      3.199   0.002       1.888231    7.959388
       prog3 |  -4.573333   1.775183     -2.576   0.011      -8.074134   -1.072533
       _cons |   51.33333   1.287853     39.860   0.000       48.79359    53.87308
    ------------------------------------------------------------------------------
    
    test prog2 prog3
    
     ( 1)  prog2 = 0.0
     ( 2)  prog3 = 0.0
    
           F(  2,   197) =   21.27
                Prob > F =    0.0000

    It is also possible to have Stata perform dummy coding on-the-fly using the xi feature.

    xi: regress write i.prog
    
    i.prog            _Iprog_1-3          (naturally coded; _Iprog_1 omitted)
    
          Source |       SS       df       MS              Number of obs =     200
    -------------+------------------------------           F(  2,   197) =   21.27
           Model |  3175.69786     2  1587.84893           Prob > F      =  0.0000
        Residual |  14703.1771   197   74.635417           R-squared     =  0.1776
    -------------+------------------------------           Adj R-squared =  0.1693
           Total |   17878.875   199   89.843593           Root MSE      =  8.6392
    
    ------------------------------------------------------------------------------
           write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
        _Iprog_2 |    4.92381   1.539279     3.20   0.002     1.888231    7.959388
        _Iprog_3 |  -4.573333   1.775183    -2.58   0.011    -8.074134   -1.072533
           _cons |   51.33333   1.287853    39.86   0.000     48.79359    53.87308
    ------------------------------------------------------------------------------
    
    test _Iprog_2 _Iprog_3
    
     ( 1)  _Iprog_2 = 0.0
     ( 2)  _Iprog_3 = 0.0
    
           F(  2,   197) =   21.27
                Prob > F =    0.0000
    

    Effect coding using xi3

    The xi3 command is available from ATS. Use findit xi3. The xi3 command works like the built-in xi command but can do more than dummy coding. Use the e prefix which stands for effect coding giving the difference from the grand mean.

    In this example group one is the reference group, i.e., the group that would be coded -1.

    xi3: regress write 3.prog
    e.prog            _Iprog_1-3          (naturally coded; _Iprog_1 omitted)
    
          Source |       SS       df       MS              Number of obs =     200
    -------------+------------------------------           F(  2,   197) =   21.27
           Model |  3175.69786     2  1587.84893           Prob > F      =  0.0000
        Residual |  14703.1771   197   74.635417           R-squared     =  0.1776
    -------------+------------------------------           Adj R-squared =  0.1693
           Total |   17878.875   199   89.843593           Root MSE      =  8.6392
    
    ------------------------------------------------------------------------------
           write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
        _Iprog_2 |   4.806984   .8161241     5.89   0.000     3.197523    6.416445
        _Iprog_3 |  -4.690159   .9626475    -4.87   0.000    -6.588576   -2.791742
           _cons |   51.45016   .6550731    78.54   0.000      50.1583    52.74201
    ------------------------------------------------------------------------------
    The ANOVA Alternative

    Many people picture anova software as being good only for classical experimental designs with categorical variables. However, the Stata anova command is actually regression in disguise. Consider the following regression that has both categorical and continuous variables and their interactions.

    use http://www.gseis.ucla.edu/courses/data/hsb2
    
    tabulate prog, gen(prog)
    
        type of |
        program |      Freq.     Percent        Cum.
    ------------+-----------------------------------
        general |         45       22.50       22.50
       academic |        105       52.50       75.00
       vocation |         50       25.00      100.00
    ------------+-----------------------------------
          Total |        200      100.00
    
    generate pXr1 =  prog1* read
    generate pXr2 =  prog2* read
    generate pXm1 =  prog1* math
    generate pXm2 =  prog2* math
    
    regress write read math female prog1 prog2 pXr1 pXr2 pXm1 pXm2
    
          Source |       SS       df       MS              Number of obs =     200
    -------------+------------------------------           F(  9,   190) =   25.80
           Model |  9833.77329     9  1092.64148           Prob > F      =  0.0000
        Residual |  8045.10171   190  42.3426406           R-squared     =  0.5500
    -------------+------------------------------           Adj R-squared =  0.5287
           Total |   17878.875   199   89.843593           Root MSE      =  6.5071
    
    ------------------------------------------------------------------------------
           write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            read |   .2685338   .1174645     2.29   0.023     .0368317    .5002358
            math |   .4798572   .1315465     3.65   0.000      .220378    .7393363
          female |   5.706612   .9390611     6.08   0.000     3.854288    7.558937
           prog1 |   3.126916   9.509236     0.33   0.743    -15.63032    21.88415
           prog2 |   8.999485   7.500741     1.20   0.232    -5.795937    23.79491
            pXr1 |   .2499231   .1666047     1.50   0.135    -.0787094    .5785556
            pXr2 |  -.0612022   .1492444    -0.41   0.682    -.3555909    .2331865
            pXm1 |  -.2725577   .1950709    -1.40   0.164    -.6573405    .1122251
            pXm2 |  -.0662714   .1658823    -0.40   0.690    -.3934789     .260936
           _cons |   8.997199   6.123886     1.47   0.143    -3.082339    21.07674
    ------------------------------------------------------------------------------
    
    test pXm1 pXm2
    
     ( 1)  pXm1 = 0.0
     ( 2)  pXm2 = 0.0
    
           F(  2,   190) =    1.06
                Prob > F =    0.3482
    
    
    test pXr1 pXr2
    
     ( 1)  pXr1 = 0.0
     ( 2)  pXr2 = 0.0
    
           F(  2,   190) =    2.25
                Prob > F =    0.1079
    
    
    test prog1 prog2
    
     ( 1)  prog1 = 0.0
     ( 2)  prog2 = 0.0
    
           F(  2,   190) =    0.78
                Prob > F =    0.4595
    

    Admittedly, that wasn't a very interesting model but it did illustrate one way to put together all the pieces involved in model with categorical and interaction terms. Now, let's look at the same model using the anova command.

    anova write read math female prog prog*read prog*math, cont(read math)
    
    
                               Number of obs =     200     R-squared     =  0.5500
                               Root MSE      = 6.50712     Adj R-squared =  0.5287
    
                      Source |  Partial SS    df       MS           F     Prob > F
                  -----------+----------------------------------------------------
                       Model |  9833.77329     9  1092.64148      25.80     0.0000
                             |
                        read |  1170.32031     1  1170.32031      27.64     0.0000
                        math |  1066.81222     1  1066.81222      25.19     0.0000
                      female |  1563.67667     1  1563.67667      36.93     0.0000
                        prog |  66.1182428     2  33.0591214       0.78     0.4595
                   prog*read |  190.783714     2  95.3918572       2.25     0.1079
                   prog*math |  89.8348393     2  44.9174197       1.06     0.3482
                             |
                    Residual |  8045.10171   190  42.3426406   
                  -----------+----------------------------------------------------
                       Total |   17878.875   199   89.843593   
    
    anova, regress
    
          Source |       SS       df       MS              Number of obs =     200
    -------------+------------------------------           F(  9,   190) =   25.80
           Model |  9833.77329     9  1092.64148           Prob > F      =  0.0000
        Residual |  8045.10171   190  42.3426406           R-squared     =  0.5500
    -------------+------------------------------           Adj R-squared =  0.5287
           Total |   17878.875   199   89.843593           Root MSE      =  6.5071
    
    ------------------------------------------------------------------------------
           write        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ------------------------------------------------------------------------------
    _cons            14.70381   6.113001     2.41   0.017     2.645745    26.76188
    read             .2685338   .1174645     2.29   0.023     .0368317    .5002358
    math             .4798572   .1315465     3.65   0.000      .220378    .7393363
    female
               1    -5.706612   .9390611    -6.08   0.000    -7.558937   -3.854288
               2    (dropped)
    prog
               1     3.126916   9.509236     0.33   0.743    -15.63032    21.88415
               2     8.999485   7.500741     1.20   0.232    -5.795937    23.79491
               3    (dropped)
    prog*read
            1        .2499231   .1666047     1.50   0.135    -.0787094    .5785556
            2       -.0612022   .1492444    -0.41   0.682    -.3555909    .2331865
            3       (dropped)
    prog*math
            1       -.2725577   .1950709    -1.40   0.164    -.6573405    .1122251
            2       -.0662714   .1658823    -0.40   0.690    -.3934789     .260936
            3       (dropped)
    ------------------------------------------------------------------------------
    
    The results are the same as the regression analysis however the set-up of the model to be tested was a little more straight forward. Let's try one more.

    ANOVA Exampe 2

    use http://www.gseis.ucla.edu/courses/data/htwt
    
    anova weight female height female*height, cont(height)
    
                               Number of obs =    1000     R-squared     =  0.2795
                               Root MSE      = 8.20887     Adj R-squared =  0.2773
    
                      Source |  Partial SS    df       MS           F     Prob > F
               --------------+----------------------------------------------------
                       Model |  26034.4351     3  8678.14505     128.78     0.0000
                             |
                      female |  587.074483     1  587.074483       8.71     0.0032
                      height |  19197.3548     1  19197.3548     284.89     0.0000
               female*height |   547.82512     1   547.82512       8.13     0.0044
                             |
                    Residual |  67115.9985   996  67.3855406   
               --------------+----------------------------------------------------
                       Total |  93150.4336   999  93.2436773   
    
    anova, regress
    
          Source |       SS       df       MS              Number of obs =    1000
    -------------+------------------------------           F(  3,   996) =  128.78
           Model |  26034.4351     3  8678.14505           Prob > F      =  0.0000
        Residual |  67115.9985   996  67.3855406           R-squared     =  0.2795
    -------------+------------------------------           Adj R-squared =  0.2773
           Total |  93150.4336   999  93.2436773           Root MSE      =  8.2089
    
    ------------------------------------------------------------------------------
          weight        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ------------------------------------------------------------------------------
    _cons           -33.75054     9.4323    -3.58   0.000       -52.26   -15.24108
    female
               1    -38.26321   12.96338    -2.95   0.003    -63.70188   -12.82455
               2    (dropped)
    height           .5479189   .0582478     9.41   0.000     .4336164    .6622215
    female*height
            1        .2227448   .0781214     2.85   0.004     .0694434    .3760463
            2       (dropped)
    ------------------------------------------------------------------------------
    
    


    UCLA Department of Education

    Phil Ender, 18dec99