Ed230B/C

Coding Categorical Variables


Consider the Following 4 Group Design:

Levela1 a2a3a4Total
1
3
2
2
2
3
4
3
5
6
4
5
10
10
9
11
Mean2.03.05.010.05.0

xi

xi is built into Stata. It does dummy coding on-the-fly for categorical variables.

xi3

xi3 is a Stata program, available from ATS via Internet (findit xi3), that can perform a number of different coding systems for categorical variables.

Dummy Coding

  • For k groups, use k-1 coded vectors.
  • Uses only zeros and ones.
  • Reference group is coded with all zeros.
  • Each coded column is one degree of freedom.
  • Constant is the mean of the reference group.
  • Regression coefficients are the differences between the each group mean and the reference group mean.

    Dummy coded variables are also known as indicator variables.

    input y  grp d1  d2  d3
     1   1   1   0   0 
     3   1   1   0   0
     2   1   1   0   0
     2   1   1   0   0
     2   2   0   1   0
     3   2   0   1   0
     4   2   0   1   0
     3   2   0   1   0
     5   3   0   0   1
     6   3   0   0   1
     4   3   0   0   1
     5   3   0   0   1
    10   4   0   0   0
    10   4   0   0   0
     9   4   0   0   0
    11   4   0   0   0
    end
    
    tabstat y, by(grp)
    
    Summary for variables: y
         by categories of: grp 
    
         grp |      mean
    ---------+----------
           1 |         2
           2 |         3
           3 |         5
           4 |        10
    ---------+----------
       Total |         5
    --------------------
    
    regress y d1 d2 d3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          d1 |         -8   .5773503    -13.856   0.000      -9.257938   -6.742062
          d2 |         -7   .5773503    -12.124   0.000      -8.257938   -5.742062
          d3 |         -5   .5773503     -8.660   0.000      -6.257938   -3.742062
       _cons |         10   .4082483     24.495   0.000       9.110503     10.8895
    ------------------------------------------------------------------------------
    
    char grp[omit] 4
    
    . xi3: regress y i.grp
    i.grp             _Igrp_1-4           (naturally coded; _Igrp_4 omitted)
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         _Igrp_1 |         -8   .5773503   -13.86   0.000    -9.257938   -6.742062
         _Igrp_2 |         -7   .5773503   -12.12   0.000    -8.257938   -5.742062
         _Igrp_3 |         -5   .5773503    -8.66   0.000    -6.257938   -3.742062
           _cons |         10   .4082483    24.49   0.000     9.110503     10.8895
    ------------------------------------------------------------------------------
    
    describe _Igrp_1 _Igrp_2 _Igrp_3
    
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    _Igrp_1         byte   %8.0g                  grp=1
    _Igrp_2         byte   %8.0g                  grp=2
    _Igrp_3         byte   %8.0g                  grp=3
    
    anova y grp
    
                               Number of obs =      16     R-squared     =  0.9500
                               Root MSE      = .816497     Adj R-squared =  0.9375
    
                      Source |  Partial SS    df       MS           F     Prob > F
                  -----------+----------------------------------------------------
                       Model |      152.00     3  50.6666667      76.00     0.0000
                             |
                         grp |      152.00     3  50.6666667      76.00     0.0000
                             |
                    Residual |        8.00    12  .666666667   
                  -----------+----------------------------------------------------
                       Total |      160.00    15  10.6666667   
    
    anova, regress
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ------------------------------------------------------------------------------
    _cons                  10   .4082483    24.49   0.000     9.110503     10.8895
    grp
               1           -8   .5773503   -13.86   0.000    -9.257938   -6.742062
               2           -7   .5773503   -12.12   0.000    -8.257938   -5.742062
               3           -5   .5773503    -8.66   0.000    -6.257938   -3.742062
               4    (dropped)
    ------------------------------------------------------------------------------

    Effect Coding

  • For k groups, use k-1 coded vectors.
  • Uses ones, zeros, and minus ones.
  • Reference group is coded -1.
  • Each coded column is one degree of freedom.
  • Constant is the unweighted grand mean.
  • Regression coefficients are differences between the group mean and the grad mean.

    Effect coding is sometimes known as deviation coding.

     input y  grp e1  e2  e3
     1   1   1   0   0 
     3   1   1   0   0
     2   1   1   0   0
     2   1   1   0   0
     2   2   0   1   0
     3   2   0   1   0
     4   2   0   1   0
     3   2   0   1   0
     5   3   0   0   1
     6   3   0   0   1
     4   3   0   0   1
     5   3   0   0   1
    10   4  -1  -1  -1
    10   4  -1  -1  -1
     9   4  -1  -1  -1
    11   4  -1  -1  -1
    end
    
    regress y e1 e2 e3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          e1 |         -3   .3535534     -8.485   0.000      -3.770327   -2.229673
          e2 |         -2   .3535534     -5.657   0.000      -2.770327   -1.229673
          e3 |          0   .3535534      0.000   1.000      -.7703266    .7703266
       _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
    ------------------------------------------------------------------------------
    
    xi3: regress y e.grp
    e.grp             _Igrp_1-4           (naturally coded; _Igrp_4 omitted)
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         _Igrp_1 |         -3   .3535534    -8.49   0.000    -3.770327   -2.229673
         _Igrp_2 |         -2   .3535534    -5.66   0.000    -2.770327   -1.229673
         _Igrp_3 |   2.36e-16   .3535534     0.00   1.000    -.7703267    .7703267
           _cons |          5   .2041241    24.49   0.000     4.555252    5.444748
    ------------------------------------------------------------------------------
    
    describe _Igrp_1 _Igrp_2 _Igrp_3
    
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    _Igrp_1         double %10.0g                 grp(1 vs. grand mean)
    _Igrp_2         double %10.0g                 grp(2 vs. grand mean)
    _Igrp_3         double %10.0g                 grp(3 vs. grand mean)

    Orthogonal Coding

  • For k groups, use k-1 coded vectors.
  • All vectors are pairwise orthogonal.
  • Constant is unweighted grand mean.
  • Each coded column is one degree of freedom.

    Example Using Orthogonal Coding

    input y  grp x1  x2  x3
     1   1   1   1   1 
     3   1   1   1   1
     2   1   1   1   1
     2   1   1   1   1
     2   2  -1   1   1
     3   2  -1   1   1
     4   2  -1   1   1
     3   2  -1   1   1
     5   3   0  -2   1
     6   3   0  -2   1
     4   3   0  -2   1
     5   3   0  -2   1
    10   4   0   0  -3
    10   4   0   0  -3
     9   4   0   0  -3
    11   4   0   0  -3
    end
    
    table grp, contents(freq mean y sd y)
    
    ----------------------------------------------
          grp |      Freq.     mean(y)       sd(y)
    ----------+-----------------------------------
            1 |          4           2    .8164966
            2 |          4           3    .8164966
            3 |          4           5    .8164966
            4 |          4          10    .8164966
    ----------------------------------------------
    
    corr x1 x2 x3
    (obs=16)
    
                 |       x1       x2       x3
    -------------+---------------------------
              x1 |   1.0000
              x2 |   0.0000   1.0000
              x3 |   0.0000   0.0000   1.0000
    
    Anova
    
    anova y grp
    
                               Number of obs =      16     R-squared     =  0.9500
                               Root MSE      = .816497     Adj R-squared =  0.9375
    
                      Source |  Partial SS    df       MS           F     Prob > F
                  -----------+----------------------------------------------------
                       Model |      152.00     3  50.6666667      76.00     0.0000
                             |
                         grp |      152.00     3  50.6666667      76.00     0.0000
                             |
                    Residual |        8.00    12  .666666667   
                  -----------+----------------------------------------------------
                       Total |      160.00    15  10.6666667 
    				   
    Regression Analysis Using Orthogonal Coding
    
    regress y x1 x2 x3
    
      Source |       SS       df       MS                  Number of obs =      16
    ---------+------------------------------               F(  3,    12) =   76.00
       Model |      152.00     3  50.6666667               Prob > F      =  0.0000
    Residual |        8.00    12  .666666667               R-squared     =  0.9500
    ---------+------------------------------               Adj R-squared =  0.9375
       Total |      160.00    15  10.6666667               Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
           y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
          x1 |        -.5   .2886751     -1.732   0.109      -1.128969    .1289691
          x2 |  -.8333333   .1666667     -5.000   0.000      -1.196469   -.4701979
          x3 |  -1.666667   .1178511    -14.142   0.000      -1.923442   -1.409891
       _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
    ------------------------------------------------------------------------------
    

    Orthogonal Coding Schema

    
    Grp X1 X2 X3 X4 X5 X6 X7 X8 X9
     1   1  1  1  1  1  1  1  1  1
     2  -1  1  1  1  1  1  1  1  1
     3   0 -2  1  1  1  1  1  1  1
     4   0  0 -3  1  1  1  1  1  1
     5   0  0  0 -4  1  1  1  1  1
     6   0  0  0  0 -5  1  1  1  1
     7   0  0  0  0  0 -6  1  1  1
     8   0  0  0  0  0  0 -7  1  1
     9   0  0  0  0  0  0  0 -8  1
    10   0  0  0  0  0  0  0  0 -9
    

    Orthogonal Coding Using xi3

    We will use the reverse Helmert coding option in our example. Reverse Helmert coding comes closest to the manual orthogonal coding shown above.

    input y  grp
     1   1 
     3   1  
     2   1 
     2   1 
     2   2 
     3   2 
     4   2 
     3   2 
     5   3  
     6   3  
     4   3 
     5   3 
    10   4 
    10   4 
     9   4 
    11   4 
    end
    
    xi3 r.grp
    r.grp             _Igrp_1-4           (naturally coded; _Igrp_1 omitted)
    
    list
                 y        grp    _Igrp_2    _Igrp_3    _Igrp_4
      1.         1          1        -.5  -.3333333       -.25
      2.         3          1        -.5  -.3333333       -.25
      3.         2          1        -.5  -.3333333       -.25
      4.         2          1        -.5  -.3333333       -.25
      5.         2          2         .5  -.3333333       -.25
      6.         3          2         .5  -.3333333       -.25
      7.         4          2         .5  -.3333333       -.25
      8.         3          2         .5  -.3333333       -.25
      9.         5          3          0   .6666667       -.25
     10.         6          3          0   .6666667       -.25
     11.         4          3          0   .6666667       -.25
     12.         5          3          0   .6666667       -.25
     13.        10          4          0          0        .75
     14.        10          4          0          0        .75
     15.         9          4          0          0        .75
     16.        11          4          0          0        .75
    
    corr _Igrp_2 _Igrp_3 _Igrp_4
    (obs=16)
    
                 |  _Igrp_2  _Igrp_3  _Igrp_4
    -------------+---------------------------
         _Igrp_2 |   1.0000
         _Igrp_3 |   0.0000   1.0000
         _Igrp_4 |   0.0000   0.0000   1.0000
    
    regress y _Igrp_2 _Igrp_3 _Igrp_4
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         _Igrp_2 |          1   .5773503     1.73   0.109    -.2579382    2.257938
         _Igrp_3 |        2.5         .5     5.00   0.000     1.410594    3.589406
         _Igrp_4 |   6.666667   .4714045    14.14   0.000     5.639564    7.693769
           _cons |          5   .2041241    24.49   0.000     4.555252    5.444748
    ------------------------------------------------------------------------------
    
    describe _Igrp_2 _Igrp_3 _Igrp_4
    
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    _Igrp_2         double %10.0g                 grp(2 vs. 1)
    _Igrp_3         double %10.0g                 grp(3 vs. 2-)
    _Igrp_4         double %10.0g                 grp(4 vs. 3-)

    Simple Coding

    Comparing each group with a reference group

  • For k groups, use k-1 coded vectors.
  • Each coded column is one degree of freedom.

    Compare the results of this coding scheme with that of dummy coding.

    input y  grp 
     1   1   
     3   1  
     2   1  
     2   1  
     2   2  
     3   2  
     4   2 
     3   2  
     5   3  
     6   3   
     4   3
     5   3  
    10   4  
    10   4   
     9   4   
    11   4 
    end
    
    xi3 g.grp
    g.grp             _Igrp_1-4           (naturally coded; _Igrp_4 omitted)
    
    describe _Igrp_1 _Igrp_2 _Igrp_3
    
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    _Igrp_1         double %10.0g                 grp(1 vs. 4)
    _Igrp_2         double %10.0g                 grp(2 vs. 4)
    _Igrp_3         double %10.0g                 grp(3 vs. 4)
    
    list _Igrp_1 _Igrp_2 _Igrp_3
    
            _Igrp_1     _Igrp_2     _Igrp_3
      1.        .75        -.25        -.25
      2.        .75        -.25        -.25
      3.        .75        -.25        -.25
      4.        .75        -.25        -.25
      5.       -.25         .75        -.25
      6.       -.25         .75        -.25
      7.       -.25         .75        -.25
      8.       -.25         .75        -.25
      9.       -.25        -.25         .75
     10.       -.25        -.25         .75
     11.       -.25        -.25         .75
     12.       -.25        -.25         .75
     13.       -.25        -.25        -.25
     14.       -.25        -.25        -.25
     15.       -.25        -.25        -.25
     16.       -.25        -.25        -.25
    
    xi3: regress y g.grp
    g.grp             _Igrp_1-4           (naturally coded; _Igrp_4 omitted)
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         _Igrp_1 |         -8   .5773503   -13.86   0.000    -9.257938   -6.742062
         _Igrp_2 |         -7   .5773503   -12.12   0.000    -8.257938   -5.742062
         _Igrp_3 |         -5   .5773503    -8.66   0.000    -6.257938   -3.742062
           _cons |          5   .2041241    24.49   0.000     4.555252    5.444748
    ------------------------------------------------------------------------------
    User Defined Coding

    char grp[user] (1,-1,0,0\0,0,1,-1\.5,.5,-.5,-.5)
    
    xi3: regress y u.grp
    u.grp             _Igrp_1-4           (naturally coded; _Igrp_4 omitted)
    
          Source |       SS       df       MS              Number of obs =      16
    -------------+------------------------------           F(  3,    12) =   76.00
           Model |      152.00     3  50.6666667           Prob > F      =  0.0000
        Residual |        8.00    12  .666666667           R-squared     =  0.9500
    -------------+------------------------------           Adj R-squared =  0.9375
           Total |      160.00    15  10.6666667           Root MSE      =   .8165
    
    ------------------------------------------------------------------------------
               y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         _Igrp_1 |         -1   .5773503    -1.73   0.109    -2.257938    .2579382
         _Igrp_2 |         -5   .5773503    -8.66   0.000    -6.257938   -3.742062
         _Igrp_3 |         -5   .4082483   -12.25   0.000    -5.889497   -4.110503
           _cons |          5   .2041241    24.49   0.000     4.555252    5.444748
    ------------------------------------------------------------------------------
    
    describe _Igrp_1 _Igrp_2 _Igrp_3
    
                  storage  display     value
    variable name   type   format      label      variable label
    -------------------------------------------------------------------------------
    _Igrp_1         double %10.0g                 grp(1 -1 0 0)
    _Igrp_2         double %10.0g                 grp(0 0 1 -1)
    _Igrp_3         double %10.0g                 grp(.5 .5 -.5 -.5)
    
    tablist grp _Igrp_1 _Igrp_2 _Igrp_3
    
          grp     _Igrp_1     _Igrp_2     _Igrp_3   Freq
            1          .5           0          .5      4
            2         -.5           0          .5      4
            3           0          .5         -.5      4
            4           0         -.5         -.5      4


    Linear Statistical Models Course

    Phil Ender, 21Feb02, 17Mar98