Principal Component Analysis Tutorial

Principal Component Analysis (PCA)

Overview

\(\bullet~~\) In the first part of the tutorial, we deal with the simple arguments of the principle component analysis, including the precedure to get the number of predictors should be represented in a model and how to plot the relevent graphs. First, we load the appropriate data and packages into the environment. The primary R function used in this tutorial prcomp belongs stats package. In this case, the package was loaded automatically. We first take a glance at the number of variables and the correlation matrix of the data:

library(faraway)
library(MASS)

## Warning: package 'MASS' was built under R version 3.4.3

library(lars)

## Loaded lars 1.2

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

data("USArrests")
dim(USArrests)

## [1] 50  4

str(USArrests)

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

options(digits=3)

p<-ncol(USArrests); p # No. of variables

## [1] 4

R<-cor(USArrests); R # Display the correlation matrix

##          Murder Assault UrbanPop  Rape
## Murder   1.0000   0.802   0.0696 0.564
## Assault  0.8019   1.000   0.2589 0.665
## UrbanPop 0.0696   0.259   1.0000 0.411
## Rape     0.5636   0.665   0.4113 1.000

\(\bullet~~\) Now, the first procedure for Principal Component Analysis is to scale and center the data using prcomp command. This command is both convenient and useful, helping you to subtract each observation value by its mean and then divide the result by the standard deviation. Then, we can use print and summary commands to see the rotation, standard deviation and proportion of variance each component contributes to the dataset, and use fviz_fig in package factoextra to plot a nice formatted scree plot.

USArrests <- as.data.frame(USArrests)
arrests.pca <- prcomp(USArrests,center = TRUE,scale. = TRUE)
names(arrests.pca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

print(arrests.pca)

## Standard deviations (1, .., p=4):
## [1] 1.575 0.995 0.597 0.416
## 
## Rotation (n x k) = (4 x 4):
##             PC1    PC2    PC3    PC4
## Murder   -0.536  0.418 -0.341  0.649
## Assault  -0.583  0.188 -0.268 -0.743
## UrbanPop -0.278 -0.873 -0.378  0.134
## Rape     -0.543 -0.167  0.818  0.089

summary(arrests.pca)

## Importance of components%s:
##                         PC1   PC2    PC3    PC4
## Standard deviation     1.57 0.995 0.5971 0.4164
## Proportion of Variance 0.62 0.247 0.0891 0.0434
## Cumulative Proportion  0.62 0.868 0.9566 1.0000

fviz_eig(arrests.pca) # Visualization of variance explained by each component

\(\bullet~~\) In the summary part, the line Proportion of Variance represents how much percentage of the variance is explained by each component. Here the first component explains \(62\%\) of the variability, the second component explains \(24.7\%\) of the variance and so on… Then the line Cumulative Proportion represents the percentage of variance explained by first k elements. Specifically, PC3 column gives the value 0.9566, which means if we include the first 3 components, then \(95.66\%\) of the variance is explained.

\(\bullet~~\) For edification only, suppose the model has n predictors, the purpose of PCA is to choose the minimum k predictors to explain a certain percent of variation: \[\sum_{i=1}^k \lambda_i \ge \sum_{i=1}^n(1-\alpha) \lambda_i\] , where \(\alpha\) could take 0.05, 0.01, etc.

\(\bullet~~\) We can install the package factoextra to improve the visualization of the scree plot (as shown above), variable plot, individual plot and biplot, which we will demonstrate below.

# Variable plot
fviz_pca_var(arrests.pca,
             col.var = "contrib", # Control variable color using their contributions to the PC
             gradient.cols = c("#70f6ff", "#00AFBB", "#ffd224",
                               "#d8ac00", "#FC4E07", "#a73203"),
             repel = TRUE,     # Avoid text overlapping
             ggtheme = theme_minimal()
             )

# Biplot
fviz_pca_biplot(arrests.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

\(\bullet~~\) Interpretation of variable plot: The variable plot gives the coordinates and the intensity of each component based on the normalized principal component scores on the first two components, with the intensity labeled by different colors. As the graph suggests, the variables Murder, Assault and Rape are closer to each than the variable UrbanPop, indicating that the former three variables are of higher similarity.

\(\bullet~~\) Interpretation of biplot: The biplot primarily gives the coordinates of each observation and variable based on the rotation (loading) on the first two components. Again, the shorter distance between two observations indicates the higher similarity between them.

\(\bullet~~\) The x and y coordinates of variables and individuals (corresponding to the first principal component and second principal component) could be computed by extracting the rotation matrix and standard deviation of each component and multiplying them accordingly. For example, the x coordinating of each vector is the product of the rotation matrix and the standard deviation of PC1. We accomplish it by using the following commands:

# Compute the coordinate of variables
func_varcoord <- function(rotation, comp.std){
  rotation * comp.std} # First create a function computing the multiplication of rotation and component standard deviation
rotation <- arrests.pca$rotation
std <- arrests.pca$sdev
varcoord <- t(apply(rotation, 1, func_varcoord, std))
varcoord[,1:2] # Compute only PC1 and PC2

##             PC1    PC2
## Murder   -0.844  0.416
## Assault  -0.918  0.187
## UrbanPop -0.438 -0.868
## Rape     -0.856 -0.166

\(\bullet~~\) The above table gives the x coordinates (PC1 column) and y coordinates (PC2 column) corresponding to Murder, Assault, Rape and UrbanPop, respectively. You can double check to verify that the result is indeed the correct coordinates.

Principal Component Regression (PCR)

Overview

\(\bullet~~\) After first part tutorial, we have gained a basic understanding of how PCA works and what are the frequently used arguments in PCA. In this second part, we would like to apply the concept of PCA to do regression (PCR). This time we use the dataset seatpos in package faraway. The data has 38 observations, and has 9 variables. We would like to fit the PCR by using hipcenter as response variable and Age, Weight, HtShoes, Ht, Seated, Arm, Thigh and Leg as predictors. Our target is to determine the number of components needed in our model.

\(\bullet~~\) Unlike the first part, we now analyze the data using K-fold cross validation, which separates the dataset into two parts: train data and test data. Since the data has 38 observations, we take K = 10 in this example and so the test data comprises of \(38\times \frac{1}{10} \approx 4\) observations and the train data comprises of \(38-4=34\) observations.

Info! For more foundation knowledge about K-fold cross-validation, please consult: https://www.coursera.org/learn/python-machine-learning/lecture/Vm0Ie/cross-validation. (If you are also a lover of Python, this is a perfect fit.)

library(faraway)
data(seatpos)
dim(seatpos)

## [1] 38  9

head(seatpos)

##   Age Weight HtShoes  Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180     187 185   95.2 36.1  45.3 41.3    -206.3
## 2  31    175     168 166   83.8 32.9  36.5 35.9    -178.2
## 3  23    100     154 152   82.9 26.0  36.6 31.0     -71.7
## 4  19    185     190 187   97.3 37.4  44.1 41.0    -257.7
## 5  23    159     178 174   93.9 29.5  40.1 36.9    -173.2
## 6  47    170     179 177   92.4 36.0  43.2 37.4    -185.2

# Initialize
predictor_seatpos <- seatpos[,-9] # Takeout the response
R<-cor(predictor_seatpos); R # Display the correlation matrix

##             Age Weight HtShoes      Ht Seated   Arm  Thigh     Leg
## Age      1.0000 0.0807 -0.0793 -0.0901 -0.170 0.360 0.0913 -0.0423
## Weight   0.0807 1.0000  0.8282  0.8285  0.776 0.698 0.5726  0.7843
## HtShoes -0.0793 0.8282  1.0000  0.9981  0.930 0.752 0.7249  0.9084
## Ht      -0.0901 0.8285  0.9981  1.0000  0.928 0.752 0.7350  0.9098
## Seated  -0.1702 0.7756  0.9297  0.9282  1.000 0.625 0.6071  0.8119
## Arm      0.3595 0.6976  0.7520  0.7521  0.625 1.000 0.6711  0.7538
## Thigh    0.0913 0.5726  0.7249  0.7350  0.607 0.671 1.0000  0.6495
## Leg     -0.0423 0.7843  0.9084  0.9098  0.812 0.754 0.6495  1.0000

seatpos.pca <- prcomp(predictor_seatpos,center = TRUE,scale. = TRUE)
eig.val <- get_eigenvalue(seatpos.pca); eig.val # Eigenvalue in PCA is the variance

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1    5.67319          70.9148                        70.9
## Dim.2    1.23679          15.4598                        86.4
## Dim.3    0.46374           5.7968                        92.2
## Dim.4    0.24096           3.0120                        95.2
## Dim.5    0.19422           2.4277                        97.6
## Dim.6    0.13917           1.7397                        99.4
## Dim.7    0.05034           0.6293                       100.0
## Dim.8    0.00159           0.0199                       100.0

summary(seatpos.pca) # Summary gives standard deviation

## Importance of components%s:
##                          PC1   PC2   PC3    PC4    PC5    PC6     PC7
## Standard deviation     2.382 1.112 0.681 0.4909 0.4407 0.3731 0.22438
## Proportion of Variance 0.709 0.155 0.058 0.0301 0.0243 0.0174 0.00629
## Cumulative Proportion  0.709 0.864 0.922 0.9518 0.9761 0.9935 0.99980
##                           PC8
## Standard deviation     0.0399
## Proportion of Variance 0.0002
## Cumulative Proportion  1.0000

fviz_eig(seatpos.pca)

\(\bullet~~\) The above gives the visualization about the percentage of variance explained by each component. It is notable that the get_eigenvalue function gives the variance while the summary command gives the standard deviation; either one is equivalent in analysis.

\(\bullet~~\) In our first sight, if we want to capture 95% of the variance of the data, we should include the first 4 components. Now let’s see if PCR produce the similar result.

set.seed(12) # reproducibility
test.index = sample(seq_len(nrow(seatpos)), size = 4) # Sort approximately 10% test data and 90% train data
test.data  = seatpos[test.index,] 
train.data = seatpos[-test.index,] # The remaining obs are train data

# Perform principal component regression to determine the number of components needed
pcr.fit <- pcr(hipcenter~Age + Weight + HtShoes + Ht + Seated + Arm +Thigh + Leg, data = train.data, ncomp =8, validation='CV', segments=10) # Total 8 components

# RMSE is the average of the mean square error taking from every fold
rmse = function(x,y) {sqrt(mean((x-y)^2))}

# Compute the RMSE of prediction, which is critical for PCR model
rmseCV <- RMSEP(pcr.fit, estimate ='CV'); rmseCV

## (Intercept)      1 comps      2 comps      3 comps      4 comps  
##       56.99        42.25        42.31        37.33        38.43  
##     5 comps      6 comps      7 comps      8 comps  
##       39.37        39.86        38.12        38.65

plot(rmseCV, xaxt="n", main='Predicted RMSE by number of components') 
axis(1, at=c(0,1,2,3,4,5,6,7,8), labels = c(0,1,2,3,4,5,6,7,8)) # Customize x-axis

numcomp.pcr <- which.min(rmseCV$val); (as.numeric(numcomp.pcr)-1) # Number of components to minimize RMSE

## [1] 3

yfit <- predict(pcr.fit, test.data, ncomp=(as.numeric(numcomp.pcr)-1)); yfit

## , , 3 comps
## 
##    hipcenter
## 3     -100.8
## 31     -99.7
## 34     -98.8
## 10    -122.0

mse.PCA <- rmse(test.data$hipcenter, yfit); mse.PCA

## [1] 39.1

\(\bullet~~\) We first use set.seed command to emphasize reproducibility. Then we divide the dataset into two parts: test.data and train.data. Then we fit the PCR by assigning validation=‘CV’ and segment=10 to indicate 10-Fold cross-validation.

\(\bullet~~\) The code chunk gives two main results: we should include 3 components to obtain the lowest RMSE (the best predicted model); the minimum RMSE is 39.1. Therefore, in this case, PCR produces a slightly different result, but both PCA and PCR are efficient ways to determine the desired model.

\(\bullet~~\) Particularly, based on PCR, our final predictive model will be \(hipcenter= \beta_0+\beta_1 pc_1+\beta_2 pc_2+\beta_3 pc_3\). (Note: This part yields a different predictive model, but it is the very nature of the cross-validation.)

Reference

Principal Component Analysis in R: prcomp vs princomp by kassambara (2017): http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Umich: Applied Machine Learning in Python: https://www.coursera.org/learn/python-machine-learning/lecture/Vm0Ie/cross-validation
For base R biplot interpretation: https://stats.stackexchange.com/questions/2038/interpretation-of-biplots-in-principal-components-analysis

Principal Component Analysis Tutorial