Ed231A

Multivariate Analysis

Principal Components Analysis


In principal components analysis we attempt to explain the total variability of p correlated variables through the use of p orthogonal principal components. The components themselves are merely weighted linear combinations of the original variables.

The first principal component can be expressed as follows,

The aj1 are scaled such that a1'a1 = 1. Y1 accounts for the maximum variablity of the p variables of any linear combination. The variance of Y1 is λ1.

Next, principal component Y2 is formed such that its variance, λ2 is the maximum amount of the remaining variance and that it is orthogonal to the first principal component. That is, a1'a2 = 0.

One continues to extract components until some stopping criteria is encountered or until p components are formed. It is possible to compute principal components from either the covariance matrix or correlation matrix of the p variables. If the variables are scaled in a similar manner than many researchers prefer to use the covariance matrix. When the variables are scaled very different from one another than using the correlation matrix is preferred. A common stopping criteria when using the correlation matrix is to stop when the variance of a component is less than one.

The weights used to create the principal components are the eigenvectors of the characteristic equation,

Where S is the covariance matrix and R is the correlation matrix. The λi are the eigenvalues, the variances of the components.

The eigenvalues are obtained by solving |S - λiI| = 0 for λi.

Consider the folllowing correlation matrix:

matrix list r
  
symmetric r[2,2]
    c1  c2
r1   1
r2  .7   1
  
matrix symeigen a l = r
  
/* eigenvalues */
matrix list l

l[1,2]
     e1   e2
r1  1.7   .3
  
matrix list a
  
/* eigenvectors */
symmetric a[2,2]
            e1          e2
r1   .70710678   .70710678
r2   .70710678  -.70710678
The equation for the eigenvalues can be expressed as solving for the determinant of

and setting it to zero, which reduces to

Using the quadratic equation

yields the two roots of 1.7 and .3. These roots are the eigenvalues also know as the characteristic values or characteristic roots. Once you have obtained the eigenvalues you use them to obtain a solution for the eigenvectors.

It is possible to interpret the eigenvectors directly but most researchers also look at the correlations between the components and the variables. These correlations are known as the component loadings.

Principal Components Analysis Example

The following example uses data for five socio-economic variables for 12 different locations. The variables are total population, median schooling, total employed, misc. professional services, and median housing value. The data are from Harman (1976).

use http://www.gseis.ucla.edu/courses/data/harman1

pca pop medsch employ profser medhouse, means
(obs=12)

 Variable |         Mean    Std. Dev.          Min          Max
----------+----------------------------------------------------
      pop |     6241.667     3439.994         1000         9900
   medsch |     11.44167     1.786545          8.3         13.7
   employ |     2333.333     1241.212          400         4000
  profser |     120.8333     114.9275           10          390
 medhouse |        17000     6367.531         9000        25000


            (principal components; 5 components retained)
Component    Eigenvalue     Difference    Proportion    Cumulative
------------------------------------------------------------------
     1        2.87331         1.07665      0.5747         0.5747
     2        1.79666         1.58182      0.3593         0.9340
     3        0.21484         0.11490      0.0430         0.9770
     4        0.09993         0.08468      0.0200         0.9969
     5        0.01526               .      0.0031         1.0000

            Eigenvectors
 Variable |      1          2          3          4          5    
----------+------------------------------------------------------
      pop |   0.34273    0.60163    0.05952    0.20403    0.68950  
   medsch |   0.45251   -0.40641    0.68882   -0.35357    0.17486  
   employ |   0.39669    0.54167    0.24796    0.02294   -0.69801  
  profser |   0.55006   -0.07782   -0.66408   -0.50039   -0.00012  
 medhouse |   0.46674   -0.41643   -0.13965    0.76318   -0.08243 
  
greigen
 

Next we will use data from the high school and beyond survey.
use http://www.gseis.ucla.edu/courses/data/hsb2

pca read write math science
(obs=200)

            (principal components; 4 components retained)
Component    Eigenvalue     Difference    Proportion    Cumulative
------------------------------------------------------------------
     1        2.85491         2.41937      0.7137         0.7137
     2        0.43554         0.06172      0.1089         0.8226
     3        0.37382         0.03810      0.0935         0.9161
     4        0.33573               .      0.0839         1.0000

               Eigenvectors
    Variable |      1          2          3          4
-------------+-------------------------------------------
        read |   0.50714   -0.22100   -0.54927    0.62632
       write |   0.48572    0.82012    0.27120    0.13393
        math |   0.51124   -0.05078   -0.38296   -0.76772
     science |   0.49550   -0.52535    0.69145    0.01980
 
pca read write math science, cov
(obs=200)

            (principal components; 4 components retained)
Component    Eigenvalue     Difference    Proportion    Cumulative
------------------------------------------------------------------
     1      272.11643       231.66023      0.7147         0.7147
     2       40.45620         3.10970      0.1063         0.8209
     3       37.34649         6.50385      0.0981         0.9190
     4       30.84264               .      0.0810         1.0000

               Eigenvectors
    Variable |      1          2          3          4
-------------+-------------------------------------------
        read |   0.54030   -0.19508   -0.73149   -0.36734
       write |   0.46626    0.79436    0.27230   -0.27829
        math |   0.48540    0.05035   -0.09084    0.86810
     science |   0.50504   -0.57306    0.61848   -0.18444
 
score cs1 cs2
            (based on unrotated principal components)
            (2 scorings not used)

               Scoring Coefficients
    Variable |      1          2
-------------+---------------------
        read |   0.54030   -0.19508
       write |   0.46626    0.79436
        math |   0.48540    0.05035
     science |   0.50504   -0.57306
 
corr cs1 cs2
(obs=200)

             |      cs1      cs2
-------------+------------------
         cs1 |   1.0000
         cs2 |   0.0000   1.0000
 
regress socst read write math science

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  4,   195) =   44.49
       Model |  10944.2858     4  2736.07144           Prob > F      =  0.0000
    Residual |  11991.9092   195  61.4969704           R-squared     =  0.4772
-------------+------------------------------           Adj R-squared =  0.4664
       Total |   22936.195   199  115.257261           Root MSE      =   7.842

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |    .380752   .0800116     4.76   0.000     .2229529    .5385511
       write |   .3751806   .0803521     4.67   0.000     .2167099    .5336512
        math |   .1322237   .0889155     1.49   0.139    -.0431359    .3075833
     science |  -.0279416   .0793993    -0.35   0.725    -.1845333      .12865
       _cons |   7.206027   3.611316     2.00   0.047     .0837748    14.32828
------------------------------------------------------------------------------
 
/* regression using  principal component scores */
regress socst cs1 cs2

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   83.68
       Model |  10535.0947     2  5267.54734           Prob > F      =  0.0000
    Residual |  12401.1003   197  62.9497478           R-squared     =  0.4593
-------------+------------------------------           Adj R-squared =  0.4538
       Total |   22936.195   199  115.257261           Root MSE      =  7.9341

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         cs1 |   .4307223   .0340952    12.63   0.000     .3634839    .4979607
         cs2 |   .2464224   .0884256     2.79   0.006     .0720402    .4208047
       _cons |   6.214682   3.632656     1.71   0.089    -.9492029    13.37857
------------------------------------------------------------------------------
 
/* anova using principal component scores */
anova cs1 prog

                           Number of obs =     200     R-squared     =  0.2154
                           Root MSE      = 14.6861     Adj R-squared =  0.2074

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  11661.6631     2  5830.83154      27.03     0.0000
                         |
                    prog |  11661.6631     2  5830.83154      27.03     0.0000
                         |
                Residual |  42489.5075   197  215.682779   
              -----------+----------------------------------------------------
                   Total |  54151.1706   199  272.116435 

Principal Components Factor Analysis

In principle components factor analysis we apply the same methods used in traditional common factor analysis but we analyze the total variance, not the common variance. This process will yield a loading matrix, in this case, a component loading matrix that can be rotated for ease of interpretation.

Summary of principle components factor analysis

  1. Method of Initial Factor Solution - Principal Components Analysis.
  2. Estimation of Communalities - One's in diagonal of correlation matrix.
  3. Number of Factors to Retain - Number of Eigenvalues greater than or equal to one.
  4. Method of Rotation - Varimax or Promax.
How to Do It

  • Let D1 be the diagonal matrix of eigenvalues of R.
  • And let D be the diagonal matrix of retained eigenvalues.
  • Let V be the matrix of retained normalized eigenvectors.

  • Let A be the matrix of component loadings:
  • Let W be the matrix of component score coefficients:
  • Let F be the matrix of component scores:
    use http://www.gseis.ucla.edu/courses/data/harman1
    
    /* principle components factor analysis */
     
    factor pop medsch employ profser medhouse, pcf
    (obs=12)
    
                (principal component factors; 2 factors retained)
      Factor     Eigenvalue     Difference    Proportion    Cumulative
    ------------------------------------------------------------------
         1        2.87331         1.07665      0.5747         0.5747
         2        1.79666         1.58182      0.3593         0.9340
         3        0.21484         0.11490      0.0430         0.9770
         4        0.09993         0.08468      0.0200         0.9969
         5        0.01526               .      0.0031         1.0000
     
                Factor Loadings
     Variable |      1          2    Uniqueness
    ----------+--------------------------------
          pop |   0.58096    0.80642    0.01217
       medsch |   0.76704   -0.54476    0.11489
       employ |   0.67243    0.72605    0.02069
      profser |   0.93239   -0.10431    0.11976
     medhouse |   0.79116   -0.55818    0.06250
     
    mat psi = e(Psi)'
    mat com = J(rowsof(psi),1,1)
    mat com = com - psi
    mat colnames com=communalities
    mat list com
    
    com[5,1]
              communalities
         pop      .98782629
      medsch      .88510558
      employ      .97930583
     profser      .88023561
    medhouse      .93750043
    
      
    rotate
     
                (varimax rotation)
                   Rotated Factor Loadings
        Variable |      1          2    Uniqueness
    -------------+--------------------------------
             pop |   0.01135    0.99383    0.01217
          medsch |   0.94079   -0.00439    0.11489
          employ |   0.13241    0.98070    0.02069
         profser |   0.82269    0.45101    0.11976
        medhouse |   0.96824   -0.00149    0.06250
    
    use http://www.gseis.ucla.edu/courses/data/hsb2
    
    /* principle components factor analysis */
    
    factor read write math science, pcf
    (obs=200)
    
                (principal component factors; 1 factor retained)
      Factor     Eigenvalue     Difference    Proportion    Cumulative
    ------------------------------------------------------------------
         1        2.85491         2.41937      0.7137         0.7137
         2        0.43554         0.06172      0.1089         0.8226
         3        0.37382         0.03810      0.0935         0.9161
         4        0.33573               .      0.0839         1.0000
    
                   Factor Loadings
        Variable |      1    Uniqueness
    -------------+---------------------
            read |   0.85688    0.26575
           write |   0.82070    0.32646
            math |   0.86382    0.25382
         science |   0.83722    0.29906
    
    mat psi = e(Psi)'
    mat com = J(rowsof(psi),1,1)
    mat com = com - psi
    mat colnames com=communalities
    mat list com
    
    com[4,1]
             communalities
       read      .73425017
      write       .6735421
       math      .74617708
    science      .70093912
     


    Ed231A Page
    UCLA Department of Education

    Phil Ender, 15oct05, 25may02; 29jan98