Data Science Simplified: Demystifying Principal Component Analysis (PCA): A Beginner's Guide with Intuitive Examples & Illustrations

In this post, let us understand

What is Principal Component Analysis (PCA)
When to use it and what are the advantages
How to perform PCA in Python with an example

What is Principal Component Analysis (PCA)?

Principal Component Analysis is an unsupervised data analysis technique. It is used for dimensionality reduction. Okay, now what is dimensionality reduction?

In simple terms, dimensionality reduction refers to reducing the number of variables. But if we reduce the number of variables, don’t we lose the information as well?

Yes, we do lose some information. Well if eliminate variables (directly dropping some of the variables), then we may lose significant amount of information. But instead of this, if we create new variables from the existing variables (i.e. feature extraction), then we may not lose much of the information.

In PCA, the objective is to reduce the variables in such a way that we are able to retain as much information as possible. Okay, now how to do it?

Simple example to illustrate PCA

Well, imagine we have a dataset which contains data on ten variables (x₁ to x₁₀) for 100 observations. The dataset looks something like this:

Dataset - ten variables (x1 to x10) and 100 observations

Now, we have to reduce this dataset into three variables without losing much information. It will look something like this:

First three principal components

That doesn't mean that there will be only three principal components. In fact, if there are 10 variables, there will be 10 principal components. But if we are going to use all 10 principal components, then what is the use of performing PCA, we could directly use the 10 original variables, isn't it?

How many Principal Components should we retain?

If the initial principal components explain maximum information (or variance) present in the data, then it is better.

Let us say if you want at least 80% of the information present in the data to be retained, how many PCs would you need?

If the total variance or the information present in the data is 100% or (or 1), then using the eigenvalues, we can find out how much of the information is explained by each of the PCs.

In the following graph, you can see that first Principal Component (PC) accounts for 70%, second PC accounts for 20% and so on. The variance explained by components decline with each component. If we retail first two PCs, then the cumulative information retained is 70% + 20% = 90% which meets our 80% criterion.

PCs and explained variance - Scree plot

How Principal Component Scores are calculated?

Principal Component scores are obtained by multiplying PCA loadings with the corresponding x values. PCA loadings are highlighted in yellow. Hence each principal component is a linear combination of the observed variables.