\(\bullet~~\) In the first part of the tutorial, we deal with the simple arguments of the principle component analysis, including the precedure to get the number of predictors should be represented in a model and how to plot the relevent graphs. First, we load the appropriate data and packages into the environment. The primary R function used in this tutorial prcomp belongs stats package. In this case, the package was loaded automatically. We first take a glance at the number of variables and the correlation matrix of the data:
library(faraway)
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
library(lars)
## Loaded lars 1.2
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
data("USArrests")
dim(USArrests)
## [1] 50 4
str(USArrests)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
options(digits=3)
p<-ncol(USArrests); p # No. of variables
## [1] 4
R<-cor(USArrests); R # Display the correlation matrix
## Murder Assault UrbanPop Rape
## Murder 1.0000 0.802 0.0696 0.564
## Assault 0.8019 1.000 0.2589 0.665
## UrbanPop 0.0696 0.259 1.0000 0.411
## Rape 0.5636 0.665 0.4113 1.000
\(\bullet~~\) Now, the first procedure for Principal Component Analysis is to scale and center the data using prcomp command. This command is both convenient and useful, helping you to subtract each observation value by its mean and then divide the result by the standard deviation. Then, we can use print and summary commands to see the rotation, standard deviation and proportion of variance each component contributes to the dataset, and use fviz_fig in package factoextra to plot a nice formatted scree plot.
USArrests <- as.data.frame(USArrests)
arrests.pca <- prcomp(USArrests,center = TRUE,scale. = TRUE)
names(arrests.pca)
## [1] "sdev" "rotation" "center" "scale" "x"
print(arrests.pca)
## Standard deviations (1, .., p=4):
## [1] 1.575 0.995 0.597 0.416
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.536 0.418 -0.341 0.649
## Assault -0.583 0.188 -0.268 -0.743
## UrbanPop -0.278 -0.873 -0.378 0.134
## Rape -0.543 -0.167 0.818 0.089
summary(arrests.pca)
## Importance of components%s:
## PC1 PC2 PC3 PC4
## Standard deviation 1.57 0.995 0.5971 0.4164
## Proportion of Variance 0.62 0.247 0.0891 0.0434
## Cumulative Proportion 0.62 0.868 0.9566 1.0000
fviz_eig(arrests.pca) # Visualization of variance explained by each component
\(\bullet~~\) In the summary part, the line Proportion of Variance represents how much percentage of the variance is explained by each component. Here the first component explains \(62\%\) of the variability, the second component explains \(24.7\%\) of the variance and so on… Then the line Cumulative Proportion represents the percentage of variance explained by first k elements. Specifically, PC3 column gives the value 0.9566, which means if we include the first 3 components, then \(95.66\%\) of the variance is explained.
\(\bullet~~\) For edification only, suppose the model has n predictors, the purpose of PCA is to choose the minimum k predictors to explain a certain percent of variation: \[\sum_{i=1}^k \lambda_i \ge \sum_{i=1}^n(1-\alpha) \lambda_i\] , where \(\alpha\) could take 0.05, 0.01, etc.
\(\bullet~~\) We can install the package factoextra to improve the visualization of the scree plot (as shown above), variable plot, individual plot and biplot, which we will demonstrate below.
# Variable plot
fviz_pca_var(arrests.pca,
col.var = "contrib", # Control variable color using their contributions to the PC
gradient.cols = c("#70f6ff", "#00AFBB", "#ffd224",
"#d8ac00", "#FC4E07", "#a73203"),
repel = TRUE, # Avoid text overlapping
ggtheme = theme_minimal()
)
# Biplot
fviz_pca_biplot(arrests.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
\(\bullet~~\) Interpretation of variable plot: The variable plot gives the coordinates and the intensity of each component based on the normalized principal component scores on the first two components, with the intensity labeled by different colors. As the graph suggests, the variables Murder, Assault and Rape are closer to each than the variable UrbanPop, indicating that the former three variables are of higher similarity.
\(\bullet~~\) Interpretation of biplot: The biplot primarily gives the coordinates of each observation and variable based on the rotation (loading) on the first two components. Again, the shorter distance between two observations indicates the higher similarity between them.
\(\bullet~~\) The x and y coordinates of variables and individuals (corresponding to the first principal component and second principal component) could be computed by extracting the rotation matrix and standard deviation of each component and multiplying them accordingly. For example, the x coordinating of each vector is the product of the rotation matrix and the standard deviation of PC1. We accomplish it by using the following commands:
# Compute the coordinate of variables
func_varcoord <- function(rotation, comp.std){
rotation * comp.std} # First create a function computing the multiplication of rotation and component standard deviation
rotation <- arrests.pca$rotation
std <- arrests.pca$sdev
varcoord <- t(apply(rotation, 1, func_varcoord, std))
varcoord[,1:2] # Compute only PC1 and PC2
## PC1 PC2
## Murder -0.844 0.416
## Assault -0.918 0.187
## UrbanPop -0.438 -0.868
## Rape -0.856 -0.166
\(\bullet~~\) The above table gives the x coordinates (PC1 column) and y coordinates (PC2 column) corresponding to Murder, Assault, Rape and UrbanPop, respectively. You can double check to verify that the result is indeed the correct coordinates.
\(\bullet~~\) After first part tutorial, we have gained a basic understanding of how PCA works and what are the frequently used arguments in PCA. In this second part, we would like to apply the concept of PCA to do regression (PCR). This time we use the dataset seatpos in package faraway. The data has 38 observations, and has 9 variables. We would like to fit the PCR by using hipcenter as response variable and Age, Weight, HtShoes, Ht, Seated, Arm, Thigh and Leg as predictors. Our target is to determine the number of components needed in our model.
\(\bullet~~\) Unlike the first part, we now analyze the data using K-fold cross validation, which separates the dataset into two parts: train data and test data. Since the data has 38 observations, we take K = 10 in this example and so the test data comprises of \(38\times \frac{1}{10} \approx 4\) observations and the train data comprises of \(38-4=34\) observations.
Info! For more foundation knowledge about K-fold cross-validation, please consult: https://www.coursera.org/learn/python-machine-learning/lecture/Vm0Ie/cross-validation. (If you are also a lover of Python, this is a perfect fit.)
library(faraway)
data(seatpos)
dim(seatpos)
## [1] 38 9
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187 185 95.2 36.1 45.3 41.3 -206.3
## 2 31 175 168 166 83.8 32.9 36.5 35.9 -178.2
## 3 23 100 154 152 82.9 26.0 36.6 31.0 -71.7
## 4 19 185 190 187 97.3 37.4 44.1 41.0 -257.7
## 5 23 159 178 174 93.9 29.5 40.1 36.9 -173.2
## 6 47 170 179 177 92.4 36.0 43.2 37.4 -185.2
# Initialize
predictor_seatpos <- seatpos[,-9] # Takeout the response
R<-cor(predictor_seatpos); R # Display the correlation matrix
## Age Weight HtShoes Ht Seated Arm Thigh Leg
## Age 1.0000 0.0807 -0.0793 -0.0901 -0.170 0.360 0.0913 -0.0423
## Weight 0.0807 1.0000 0.8282 0.8285 0.776 0.698 0.5726 0.7843
## HtShoes -0.0793 0.8282 1.0000 0.9981 0.930 0.752 0.7249 0.9084
## Ht -0.0901 0.8285 0.9981 1.0000 0.928 0.752 0.7350 0.9098
## Seated -0.1702 0.7756 0.9297 0.9282 1.000 0.625 0.6071 0.8119
## Arm 0.3595 0.6976 0.7520 0.7521 0.625 1.000 0.6711 0.7538
## Thigh 0.0913 0.5726 0.7249 0.7350 0.607 0.671 1.0000 0.6495
## Leg -0.0423 0.7843 0.9084 0.9098 0.812 0.754 0.6495 1.0000
seatpos.pca <- prcomp(predictor_seatpos,center = TRUE,scale. = TRUE)
eig.val <- get_eigenvalue(seatpos.pca); eig.val # Eigenvalue in PCA is the variance
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.67319 70.9148 70.9
## Dim.2 1.23679 15.4598 86.4
## Dim.3 0.46374 5.7968 92.2
## Dim.4 0.24096 3.0120 95.2
## Dim.5 0.19422 2.4277 97.6
## Dim.6 0.13917 1.7397 99.4
## Dim.7 0.05034 0.6293 100.0
## Dim.8 0.00159 0.0199 100.0
summary(seatpos.pca) # Summary gives standard deviation
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.382 1.112 0.681 0.4909 0.4407 0.3731 0.22438
## Proportion of Variance 0.709 0.155 0.058 0.0301 0.0243 0.0174 0.00629
## Cumulative Proportion 0.709 0.864 0.922 0.9518 0.9761 0.9935 0.99980
## PC8
## Standard deviation 0.0399
## Proportion of Variance 0.0002
## Cumulative Proportion 1.0000
fviz_eig(seatpos.pca)
\(\bullet~~\) The above gives the visualization about the percentage of variance explained by each component. It is notable that the get_eigenvalue function gives the variance while the summary command gives the standard deviation; either one is equivalent in analysis.
\(\bullet~~\) In our first sight, if we want to capture 95% of the variance of the data, we should include the first 4 components. Now let’s see if PCR produce the similar result.
set.seed(12) # reproducibility
test.index = sample(seq_len(nrow(seatpos)), size = 4) # Sort approximately 10% test data and 90% train data
test.data = seatpos[test.index,]
train.data = seatpos[-test.index,] # The remaining obs are train data
# Perform principal component regression to determine the number of components needed
pcr.fit <- pcr(hipcenter~Age + Weight + HtShoes + Ht + Seated + Arm +Thigh + Leg, data = train.data, ncomp =8, validation='CV', segments=10) # Total 8 components
# RMSE is the average of the mean square error taking from every fold
rmse = function(x,y) {sqrt(mean((x-y)^2))}
# Compute the RMSE of prediction, which is critical for PCR model
rmseCV <- RMSEP(pcr.fit, estimate ='CV'); rmseCV
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## 56.99 42.25 42.31 37.33 38.43
## 5 comps 6 comps 7 comps 8 comps
## 39.37 39.86 38.12 38.65
plot(rmseCV, xaxt="n", main='Predicted RMSE by number of components')
axis(1, at=c(0,1,2,3,4,5,6,7,8), labels = c(0,1,2,3,4,5,6,7,8)) # Customize x-axis
numcomp.pcr <- which.min(rmseCV$val); (as.numeric(numcomp.pcr)-1) # Number of components to minimize RMSE
## [1] 3
yfit <- predict(pcr.fit, test.data, ncomp=(as.numeric(numcomp.pcr)-1)); yfit
## , , 3 comps
##
## hipcenter
## 3 -100.8
## 31 -99.7
## 34 -98.8
## 10 -122.0
mse.PCA <- rmse(test.data$hipcenter, yfit); mse.PCA
## [1] 39.1
\(\bullet~~\) We first use set.seed command to emphasize reproducibility. Then we divide the dataset into two parts: test.data and train.data. Then we fit the PCR by assigning validation=‘CV’ and segment=10 to indicate 10-Fold cross-validation.
\(\bullet~~\) The code chunk gives two main results: we should include 3 components to obtain the lowest RMSE (the best predicted model); the minimum RMSE is 39.1. Therefore, in this case, PCR produces a slightly different result, but both PCA and PCR are efficient ways to determine the desired model.
\(\bullet~~\) Particularly, based on PCR, our final predictive model will be \(hipcenter= \beta_0+\beta_1 pc_1+\beta_2 pc_2+\beta_3 pc_3\). (Note: This part yields a different predictive model, but it is the very nature of the cross-validation.)
Principal Component Analysis in R: prcomp vs princomp by kassambara (2017): http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Umich: Applied Machine Learning in Python: https://www.coursera.org/learn/python-machine-learning/lecture/Vm0Ie/cross-validation
For base R biplot interpretation: https://stats.stackexchange.com/questions/2038/interpretation-of-biplots-in-principal-components-analysis