
Cross Validation
Shrinkage
Estimating Shrinkage

Cross Validation
Double Cross Validation
Stata Cross Validation Example
We will begin by using Orange County 1999 API data for 71 high schools.
use http://www.gseis.ucla.edu/courses/data/ochi
regress api99 pctmeal pctel yrrnd core avged pctemer
Source | SS df MS Number of obs = 67
---------+------------------------------ F( 6, 60) = 77.36
Model | 846646.429 6 141107.738 Prob > F = 0.0000
Residual | 109446.288 60 1824.1048 R-squared = 0.8855
---------+------------------------------ Adj R-squared = 0.8741
Total | 956092.716 66 14486.2533 Root MSE = 42.71
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
pctmeal | .6088963 .3348236 1.819 0.074 -.0608506 1.278643
pctel | -3.372816 .7189063 -4.692 0.000 -4.810842 -1.934789
yrrnd | -30.10476 24.75397 -1.216 0.229 -79.62008 19.41056
core | -4.97981 3.339713 -1.491 0.141 -11.66023 1.70061
avged | 69.37089 15.76158 4.401 0.000 37.84305 100.8987
pctemer | -.1026734 .8709507 -0.118 0.907 -1.844834 1.639487
_cons | 693.2872 105.3493 6.581 0.000 482.5573 904.0171
------------------------------------------------------------------------------
Now we will load the 1999 API data for 236 Los Angeles County high schools. This dataset, lahi, has the same variables as the first dataset, ochi.
use http://www.gseis.ucla.edu/courses/data/lahi
predict p1
(option xb assumed; fitted values)
(10 missing values generated)
corr api99 p1
(obs=67)
| api99 p1
---------+------------------
api99 | 1.0000
p1 | 0.8522 1.0000
display "R-squared = ", .8522^2
R-squared = .72624484Note that the R2 of .7262 is much lower than the R2 of .8855 from our original regression analysis.
Stata Cross Validation Method 2
use http://www.gseis.ucla.edu/courses/data/ochi
count
71
append using http://www.gseis.ucla.edu/courses/data/lahi
(label yn already defined)
count
307
generate sample=1 in 1/71
(236 missing values generated)
replace sample=2 in 72/l
(236 real changes made)
regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1
Source | SS df MS Number of obs = 67
-------------+------------------------------ F( 6, 60) = 77.36
Model | 846646.429 6 141107.738 Prob > F = 0.0000
Residual | 109446.288 60 1824.1048 R-squared = 0.8855
-------------+------------------------------ Adj R-squared = 0.8741
Total | 956092.716 66 14486.2533 Root MSE = 42.71
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pctmeal | .6088963 .3348236 1.82 0.074 -.0608507 1.278643
pctel | -3.372816 .7189063 -4.69 0.000 -4.810842 -1.934789
yrrnd | -30.10476 24.75397 -1.22 0.229 -79.62008 19.41056
core | -4.97981 3.339713 -1.49 0.141 -11.66023 1.700611
avged | 69.37089 15.76158 4.40 0.000 37.84305 100.8987
pctemer | -.1026734 .8709507 -0.12 0.907 -1.844834 1.639487
_cons | 693.2872 105.3493 6.58 0.000 482.5573 904.0171
------------------------------------------------------------------------------
predict pre
(option xb assumed; fitted values)
(14 missing values generated)
corr api99 pre if sample==2
(obs=226)
| api99 pre
-------------+------------------
api99 | 1.0000
pre | 0.8522 1.0000
display r(rho)^2
.72620797
Phil Ender, 29Jan98