
The effect of scaling predictor variables can be easily demonstrated using the variable read in the hsb2 dataset. We will begin with a model regressing write on female and read.
Example 1
use http://www.ats.ucla.edu/courses/data/hsb2
regress write female read
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32118 2 3928.16059 Prob > F = 0.0000
Residual | 10022.5538 197 50.8759077 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337
Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
read | .5658869 .0493849 11.46 0.000 .468496 .6632778
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
The coefficient for read (.57) indicates how much change is expected in write when
there is a one unit increase in read with female held constant. The concern here
is that a one unit change might not be terribly meaningful. Suppose that research has
indicated that a 12 point change in read is meaningful. Here is what you could do.
generate read12 = read/12
regress write female read12
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32128 2 3928.16064 Prob > F = 0.0000
Residual | 10022.5537 197 50.8759072 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337
Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
read12 | 6.790643 .5926186 11.46 0.000 5.621953 7.959334
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
Now a one unit change in read12 predicts a 6.8 point change in write with female
held constant. A one point change in read12 is equivalent to a 12 point change in read.
Note that the standardized coefficients are identical for read and read12.
regress write female read, beta noheader
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 .2889851
read | .5658869 .0493849 11.46 0.000 .6121169
_cons | 20.22837 2.713756 7.45 0.000 .
------------------------------------------------------------------------------
regress write female read12, beta noheader
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 .2889851
read12 | 6.790643 .5926186 11.46 0.000 .6121169
_cons | 20.22837 2.713756 7.45 0.000 .
------------------------------------------------------------------------------
Example 2Now, what if reading was a categorical variable? We will divide read up into five categories. Please realize that I am not suggesting that you should take a continuous variable and break it up into categories, but to show the effect of scaling read as a categorical variable.
egen readcat = cut(read), group(5) icodes
tabulate readcat
readcat | Freq. Percent Cum.
------------+-----------------------------------
0 | 39 19.50 19.50
1 | 16 8.00 27.50
2 | 62 31.00 58.50
3 | 37 18.50 77.00
4 | 46 23.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabstat read, by(readcat)
Summary for variables: read
by categories of: readcat
readcat | mean
---------+----------
0 | 38.61538
1 | 44.25
2 | 49.22581
3 | 57.13514
4 | 66.65217
---------+----------
Total | 52.23
--------------------
Let's run a regression using xi to produce dummy coding on the fly.
xi: regress write female i.readcat
i.readcat _Ireadcat_0-4 (naturally coded; _Ireadcat_0 omitted)
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 5, 194) = 28.02
Model | 7497.74329 5 1499.54866 Prob > F = 0.0000
Residual | 10381.1317 194 53.5109882 R-squared = 0.4194
-------------+------------------------------ Adj R-squared = 0.4044
Total | 17878.875 199 89.843593 Root MSE = 7.3151
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.714997 1.0442 5.47 0.000 3.655556 7.774438
_Ireadcat_1 | 2.237243 2.173774 1.03 0.305 -2.050021 6.524508
_Ireadcat_2 | 6.692244 1.495062 4.48 0.000 3.743581 9.640906
_Ireadcat_3 | 11.49109 1.680671 6.84 0.000 8.176362 14.80583
_Ireadcat_4 | 15.76366 1.596531 9.87 0.000 12.61487 18.91244
_cons | 41.65526 1.323366 31.48 0.000 39.04523 44.26529
------------------------------------------------------------------------------
test _Ireadcat_1 _Ireadcat_2 _Ireadcat_3 _Ireadcat_4
( 1) _Ireadcat_1 = 0.0
( 2) _Ireadcat_2 = 0.0
( 3) _Ireadcat_3 = 0.0
( 4) _Ireadcat_4 = 0.0
F( 4, 194) = 29.53
Prob > F = 0.0000
We see that overall readcat is a significant predictor of write. The
R2 for this model is .4199 as compared to .4394 when read is continuous.
Next, let's use readcat in a model but treat it as a one degree of freedom linear
predictor.
regress write female readcat
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 69.99
Model | 7426.82985 2 3713.41492 Prob > F = 0.0000
Residual | 10452.0452 197 53.0560668 R-squared = 0.4154
-------------+------------------------------ Adj R-squared = 0.4095
Total | 17878.875 199 89.843593 Root MSE = 7.284
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.688666 1.037052 5.49 0.000 3.643518 7.733814
readcat | 4.030212 .3713078 10.85 0.000 3.297964 4.76246
_cons | 40.90897 1.141635 35.83 0.000 38.65757 43.16036
------------------------------------------------------------------------------
The linear form of readcat is still significant but the R2 for the model has gone
down to .4154, a trivial difference for a gain of four degrees of freedom in the residual.We can test to see if the difference between using read and readcat is significant by including both in a model. The significant coefficient for read (below) suggests that the continuous form of read accounts variability in reading that is not captured in the categorical form.
xi: regress write female i.readcat read
i.readcat _Ireadcat_0-4 (naturally coded; _Ireadcat_0 omitted)
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 6, 193) = 25.62
Model | 7926.37551 6 1321.06258 Prob > F = 0.0000
Residual | 9952.49949 193 51.5673549 R-squared = 0.4433
-------------+------------------------------ Adj R-squared = 0.4260
Total | 17878.875 199 89.843593 Root MSE = 7.181
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.469592 1.028589 5.32 0.000 3.440875 7.49831
_Ireadcat_1 | -.6631964 2.359184 -0.28 0.779 -5.31629 3.989897
_Ireadcat_2 | 1.273686 2.384601 0.53 0.594 -3.429538 5.97691
_Ireadcat_3 | 2.011662 3.678693 0.55 0.585 -5.243941 9.267264
_Ireadcat_4 | 1.413838 5.218196 0.27 0.787 -8.878174 11.70585
read | .5108452 .177188 2.88 0.004 .1613717 .8603187
_cons | 22.0735 6.91511 3.19 0.002 8.43461 35.71239
------------------------------------------------------------------------------
Phil Ender, 22dec00