Principal Component Analysis
Introduction
PCA summarizes a dataset with a small number of representative variables that contain the most information, by finding a set of variables that have maximal variance and are mutually uncorrelated.
A linear combination of the features that has the largest variance:
We refer to as the loadings of the .
A linear combination of the features that has the largest variance, among all combinations uncorrelated with :
Uncorrelation is good because an additional column always provide us new information.
- Blue names represent scores of the first two s.
- Red names represent the loadings vectors of the first two s. Top axis is the loading of the first and right axis is the second .
Changing the signs of 's will flip or rotate the biplot but it does not change the PCA result (as variance is not affected by the sign).
Example
The word UrbanPop
is positioned at , which means that the loading of the first is 0.25 and the second is 0.75.
- Overall, we see that crime-related variables are located close to each other, and
UrbanPop
is located far from them. - States on the far right have high crime rates, while states on the far left have low crime rates.
To understand the strength of each component, we can do PVE to see how much variance is explained by each component.
A scree plot is a line plot of PVE against number of s. The elbow point is the point where the curve starts to flatten out, which would be the number of s to keep.
In the image below, the conclusion is to use 3 PCs.