
Regression Analysis So Far...
Coding Methods for Categorical Variables
Consider the Following 4 Group Design:
Level a1
a2 a3 a4 Total
1
3
2
2
2
3
4
3
5
6
4
5
10
10
9
11
Mean 2.0 3.0 5.0 10.0 5.0
Dummy Coding
Example Using Dummy Coding
y grp x1 x2 x3 1 1 1 0 0 3 1 1 0 0 2 1 1 0 0 2 1 1 0 0 2 2 0 1 0 3 2 0 1 0 4 2 0 1 0 3 2 0 1 0 5 3 0 0 1 6 3 0 0 1 4 3 0 0 1 5 3 0 0 1 10 4 0 0 0 10 4 0 0 0 9 4 0 0 0 11 4 0 0 0
Regression Analysis Using Dummy Coding
regress y x1 x2 x3
Source | SS df MS Number of obs = 16
---------+------------------------------ F( 3, 12) = 76.00
Model | 152.00 3 50.6666667 Prob > F = 0.0000
Residual | 8.00 12 .666666667 R-squared = 0.9500
---------+------------------------------ Adj R-squared = 0.9375
Total | 160.00 15 10.6666667 Root MSE = .8165
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | -8 .5773503 -13.856 0.000 -9.257938 -6.742062
x2 | -7 .5773503 -12.124 0.000 -8.257938 -5.742062
x3 | -5 .5773503 -8.660 0.000 -6.257938 -3.742062
_cons | 10 .4082483 24.495 0.000 9.110503 10.8895
------------------------------------------------------------------------------
Interpretation of Coefficients
Effect Coding
Example Using Effect Coding
y grp x1 x2 x3 1 1 1 0 0 3 1 1 0 0 2 1 1 0 0 2 1 1 0 0 2 2 0 1 0 3 2 0 1 0 4 2 0 1 0 3 2 0 1 0 5 3 0 0 1 6 3 0 0 1 4 3 0 0 1 5 3 0 0 1 10 4 -1 -1 -1 10 4 -1 -1 -1 9 4 -1 -1 -1 11 4 -1 -1 -1
The Linear Model
Yij = m + aj + ei(j)
Where aj represents the treatment effect of the jth group.
Regression Analysis Using Effect Coding
regress y x1 x2 x3
Source | SS df MS Number of obs = 16
---------+------------------------------ F( 3, 12) = 76.00
Model | 152.00 3 50.6666667 Prob > F = 0.0000
Residual | 8.00 12 .666666667 R-squared = 0.9500
---------+------------------------------ Adj R-squared = 0.9375
Total | 160.00 15 10.6666667 Root MSE = .8165
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | -3 .3535534 -8.485 0.000 -3.770327 -2.229673
x2 | -2 .3535534 -5.657 0.000 -2.770327 -1.229673
x3 | 0 .3535534 0.000 1.000 -.7703266 .7703266
_cons | 5 .2041241 24.495 0.000 4.555252 5.444748
------------------------------------------------------------------------------
Interpretation of Coefficients
F-ratio Using R2

An example using hsb2
Let's analyze the hsb2 data for the variable program type (prog) using write as the dependent variable. We will dummy code prog using the tabulate command with the generate option to create the dummy variables for us automatically.
use http://www.gseis.ucla.edu/courses/data/hsb2
tab prog, gen(prog)
type of |
program | Freq. Percent Cum.
------------+-----------------------------------
general | 45 22.50 22.50
academic | 105 52.50 75.00
vocation | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00ed)
regress write prog2 prog3
Source | SS df MS Number of obs = 200
---------+------------------------------ F( 2, 197) = 21.27
Model | 3175.69786 2 1587.84893 Prob > F = 0.0000
Residual | 14703.1771 197 74.635417 R-squared = 0.1776
---------+------------------------------ Adj R-squared = 0.1693
Total | 17878.875 199 89.843593 Root MSE = 8.6392
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
prog2 | 4.92381 1.539279 3.199 0.002 1.888231 7.959388
prog3 | -4.573333 1.775183 -2.576 0.011 -8.074134 -1.072533
_cons | 51.33333 1.287853 39.860 0.000 48.79359 53.87308
------------------------------------------------------------------------------
test prog2 prog3
( 1) prog2 = 0.0
( 2) prog3 = 0.0
F( 2, 197) = 21.27
Prob > F = 0.0000It is also possible to have Stata perform dummy coding on-the-fly using the xi feature.
xi: regress write i.prog
i.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 21.27
Model | 3175.69786 2 1587.84893 Prob > F = 0.0000
Residual | 14703.1771 197 74.635417 R-squared = 0.1776
-------------+------------------------------ Adj R-squared = 0.1693
Total | 17878.875 199 89.843593 Root MSE = 8.6392
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iprog_2 | 4.92381 1.539279 3.20 0.002 1.888231 7.959388
_Iprog_3 | -4.573333 1.775183 -2.58 0.011 -8.074134 -1.072533
_cons | 51.33333 1.287853 39.86 0.000 48.79359 53.87308
------------------------------------------------------------------------------
test _Iprog_2 _Iprog_3
( 1) _Iprog_2 = 0.0
( 2) _Iprog_3 = 0.0
F( 2, 197) = 21.27
Prob > F = 0.0000
Effect coding using xi3
The xi3 command is available from ATS. Use findit xi3. The xi3 command works like the built-in xi command but can do more than dummy coding. Use the e prefix which stands for effect coding giving the difference from the grand mean.
In this example group one is the reference group, i.e., the group that would be coded -1.
xi3: regress write 3.prog
e.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted)
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 21.27
Model | 3175.69786 2 1587.84893 Prob > F = 0.0000
Residual | 14703.1771 197 74.635417 R-squared = 0.1776
-------------+------------------------------ Adj R-squared = 0.1693
Total | 17878.875 199 89.843593 Root MSE = 8.6392
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iprog_2 | 4.806984 .8161241 5.89 0.000 3.197523 6.416445
_Iprog_3 | -4.690159 .9626475 -4.87 0.000 -6.588576 -2.791742
_cons | 51.45016 .6550731 78.54 0.000 50.1583 52.74201
------------------------------------------------------------------------------
The ANOVA AlternativeMany people picture anova software as being good only for classical experimental designs with categorical variables. However, the Stata anova command is actually regression in disguise. Consider the following regression that has both categorical and continuous variables and their interactions.
use http://www.gseis.ucla.edu/courses/data/hsb2
tabulate prog, gen(prog)
type of |
program | Freq. Percent Cum.
------------+-----------------------------------
general | 45 22.50 22.50
academic | 105 52.50 75.00
vocation | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
generate pXr1 = prog1* read
generate pXr2 = prog2* read
generate pXm1 = prog1* math
generate pXm2 = prog2* math
regress write read math female prog1 prog2 pXr1 pXr2 pXm1 pXm2
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 9, 190) = 25.80
Model | 9833.77329 9 1092.64148 Prob > F = 0.0000
Residual | 8045.10171 190 42.3426406 R-squared = 0.5500
-------------+------------------------------ Adj R-squared = 0.5287
Total | 17878.875 199 89.843593 Root MSE = 6.5071
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .2685338 .1174645 2.29 0.023 .0368317 .5002358
math | .4798572 .1315465 3.65 0.000 .220378 .7393363
female | 5.706612 .9390611 6.08 0.000 3.854288 7.558937
prog1 | 3.126916 9.509236 0.33 0.743 -15.63032 21.88415
prog2 | 8.999485 7.500741 1.20 0.232 -5.795937 23.79491
pXr1 | .2499231 .1666047 1.50 0.135 -.0787094 .5785556
pXr2 | -.0612022 .1492444 -0.41 0.682 -.3555909 .2331865
pXm1 | -.2725577 .1950709 -1.40 0.164 -.6573405 .1122251
pXm2 | -.0662714 .1658823 -0.40 0.690 -.3934789 .260936
_cons | 8.997199 6.123886 1.47 0.143 -3.082339 21.07674
------------------------------------------------------------------------------
test pXm1 pXm2
( 1) pXm1 = 0.0
( 2) pXm2 = 0.0
F( 2, 190) = 1.06
Prob > F = 0.3482
test pXr1 pXr2
( 1) pXr1 = 0.0
( 2) pXr2 = 0.0
F( 2, 190) = 2.25
Prob > F = 0.1079
test prog1 prog2
( 1) prog1 = 0.0
( 2) prog2 = 0.0
F( 2, 190) = 0.78
Prob > F = 0.4595
Admittedly, that wasn't a very interesting model but it did illustrate one way to put together all the pieces involved in model with categorical and interaction terms. Now, let's look at the same model using the anova command.
anova write read math female prog prog*read prog*math, cont(read math)
Number of obs = 200 R-squared = 0.5500
Root MSE = 6.50712 Adj R-squared = 0.5287
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 9833.77329 9 1092.64148 25.80 0.0000
|
read | 1170.32031 1 1170.32031 27.64 0.0000
math | 1066.81222 1 1066.81222 25.19 0.0000
female | 1563.67667 1 1563.67667 36.93 0.0000
prog | 66.1182428 2 33.0591214 0.78 0.4595
prog*read | 190.783714 2 95.3918572 2.25 0.1079
prog*math | 89.8348393 2 44.9174197 1.06 0.3482
|
Residual | 8045.10171 190 42.3426406
-----------+----------------------------------------------------
Total | 17878.875 199 89.843593
anova, regress
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 9, 190) = 25.80
Model | 9833.77329 9 1092.64148 Prob > F = 0.0000
Residual | 8045.10171 190 42.3426406 R-squared = 0.5500
-------------+------------------------------ Adj R-squared = 0.5287
Total | 17878.875 199 89.843593 Root MSE = 6.5071
------------------------------------------------------------------------------
write Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------------------------------------------------------------------
_cons 14.70381 6.113001 2.41 0.017 2.645745 26.76188
read .2685338 .1174645 2.29 0.023 .0368317 .5002358
math .4798572 .1315465 3.65 0.000 .220378 .7393363
female
1 -5.706612 .9390611 -6.08 0.000 -7.558937 -3.854288
2 (dropped)
prog
1 3.126916 9.509236 0.33 0.743 -15.63032 21.88415
2 8.999485 7.500741 1.20 0.232 -5.795937 23.79491
3 (dropped)
prog*read
1 .2499231 .1666047 1.50 0.135 -.0787094 .5785556
2 -.0612022 .1492444 -0.41 0.682 -.3555909 .2331865
3 (dropped)
prog*math
1 -.2725577 .1950709 -1.40 0.164 -.6573405 .1122251
2 -.0662714 .1658823 -0.40 0.690 -.3934789 .260936
3 (dropped)
------------------------------------------------------------------------------
The results are the same as the regression analysis however the set-up of the model to be
tested was a little more straight forward. Let's try one more.ANOVA Exampe 2
use http://www.gseis.ucla.edu/courses/data/htwt
anova weight female height female*height, cont(height)
Number of obs = 1000 R-squared = 0.2795
Root MSE = 8.20887 Adj R-squared = 0.2773
Source | Partial SS df MS F Prob > F
--------------+----------------------------------------------------
Model | 26034.4351 3 8678.14505 128.78 0.0000
|
female | 587.074483 1 587.074483 8.71 0.0032
height | 19197.3548 1 19197.3548 284.89 0.0000
female*height | 547.82512 1 547.82512 8.13 0.0044
|
Residual | 67115.9985 996 67.3855406
--------------+----------------------------------------------------
Total | 93150.4336 999 93.2436773
anova, regress
Source | SS df MS Number of obs = 1000
-------------+------------------------------ F( 3, 996) = 128.78
Model | 26034.4351 3 8678.14505 Prob > F = 0.0000
Residual | 67115.9985 996 67.3855406 R-squared = 0.2795
-------------+------------------------------ Adj R-squared = 0.2773
Total | 93150.4336 999 93.2436773 Root MSE = 8.2089
------------------------------------------------------------------------------
weight Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------------------------------------------------------------------
_cons -33.75054 9.4323 -3.58 0.000 -52.26 -15.24108
female
1 -38.26321 12.96338 -2.95 0.003 -63.70188 -12.82455
2 (dropped)
height .5479189 .0582478 9.41 0.000 .4336164 .6622215
female*height
1 .2227448 .0781214 2.85 0.004 .0694434 .3760463
2 (dropped)
------------------------------------------------------------------------------
Phil Ender, 18dec99