Ed230B/C

Winter

Cross Validation


Shrinkage

  • The tendency for regression to bias the R upwards due to capitalizing on chance associations within a sample.
  • When the regression weights are applied to another sample the R2 is smaller.
  • Degree of overestimation is affected by ratio of independent variables to sample size.
  • Rules of thumb:

    Estimating Shrinkage

    Cross Validation

  • Involves two samples: Sample 1) the screening sample and Sample 2) the calibration sample.
  • Compute regression coefficients in screening sample.
  • Use weights from the screening sample to compute predicted scores in calibration sample.
  • Compute r2YY' in calibration sample.
  • If shrinkage is small and coefficients change little, combine samples and recompute regression.

    Double Cross Validation

  • Do cross validation using "Sample A" as screening sample and "Sample B" as calibration sample.
  • Repeat cross validation using "Sample B" as screening sample and "Sample A" as calibration sample.

    Stata Cross Validation Example

    We will begin by using Orange County 1999 API data for 71 high schools.

    use http://www.gseis.ucla.edu/courses/data/ochi
    
    regress api99 pctmeal pctel yrrnd core avged pctemer
    
      Source |       SS       df       MS                  Number of obs =      67
    ---------+------------------------------               F(  6,    60) =   77.36
       Model |  846646.429     6  141107.738               Prob > F      =  0.0000
    Residual |  109446.288    60   1824.1048               R-squared     =  0.8855
    ---------+------------------------------               Adj R-squared =  0.8741
       Total |  956092.716    66  14486.2533               Root MSE      =   42.71
    
    ------------------------------------------------------------------------------
       api99 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
     pctmeal |   .6088963   .3348236      1.819   0.074      -.0608506    1.278643
       pctel |  -3.372816   .7189063     -4.692   0.000      -4.810842   -1.934789
       yrrnd |  -30.10476   24.75397     -1.216   0.229      -79.62008    19.41056
        core |   -4.97981   3.339713     -1.491   0.141      -11.66023     1.70061
       avged |   69.37089   15.76158      4.401   0.000       37.84305    100.8987
     pctemer |  -.1026734   .8709507     -0.118   0.907      -1.844834    1.639487
       _cons |   693.2872   105.3493      6.581   0.000       482.5573    904.0171
    ------------------------------------------------------------------------------
    

    Now we will load the 1999 API data for 236 Los Angeles County high schools. This dataset, lahi, has the same variables as the first dataset, ochi.

    use http://www.gseis.ucla.edu/courses/data/lahi
    
    predict p1
    (option xb assumed; fitted values)
    (10 missing values generated)
    
    corr api99 p1
    (obs=67)
    
             |    api99       p1
    ---------+------------------
       api99 |   1.0000
          p1 |   0.8522   1.0000
    
    
    display "R-squared = ", .8522^2
    
    R-squared =  .72624484

    Note that the R2 of .7262 is much lower than the R2 of .8855 from our original regression analysis.

    Stata Cross Validation Method 2

    use http://www.gseis.ucla.edu/courses/data/ochi
    
    count
       71
    
    append using http://www.gseis.ucla.edu/courses/data/lahi
    (label yn already defined)
    
    count
      307
    
    generate sample=1 in 1/71
    (236 missing values generated)
    
    replace sample=2 in 72/l
    (236 real changes made)
    
    regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1
    
          Source |       SS       df       MS              Number of obs =      67
    -------------+------------------------------           F(  6,    60) =   77.36
           Model |  846646.429     6  141107.738           Prob > F      =  0.0000
        Residual |  109446.288    60   1824.1048           R-squared     =  0.8855
    -------------+------------------------------           Adj R-squared =  0.8741
           Total |  956092.716    66  14486.2533           Root MSE      =   42.71
    
    ------------------------------------------------------------------------------
           api99 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         pctmeal |   .6088963   .3348236     1.82   0.074    -.0608507    1.278643
           pctel |  -3.372816   .7189063    -4.69   0.000    -4.810842   -1.934789
           yrrnd |  -30.10476   24.75397    -1.22   0.229    -79.62008    19.41056
            core |   -4.97981   3.339713    -1.49   0.141    -11.66023    1.700611
           avged |   69.37089   15.76158     4.40   0.000     37.84305    100.8987
         pctemer |  -.1026734   .8709507    -0.12   0.907    -1.844834    1.639487
           _cons |   693.2872   105.3493     6.58   0.000     482.5573    904.0171
    ------------------------------------------------------------------------------
    
    predict pre
    (option xb assumed; fitted values)
    (14 missing values generated)
    
    corr api99 pre if sample==2
    (obs=226)
    
                 |    api99      pre
    -------------+------------------
           api99 |   1.0000
             pre |   0.8522   1.0000
    
    
    display r(rho)^2
    .72620797


    UCLA Department of Education

    Phil Ender, 29Jan98