Data Science Simplified: Z-Scores Explained with Examples: A Beginner's Guide

In this blog let us understand the basics of the Z-score. This is useful to know how the data is distributed.

Let us start.

Examples:

Dataset A is : 10, 12, 14, 16.

Dataset B is: 10, 200, 350, 600.

As you can see, both the mean of B (290) and the standard deviation of B (13) are higher than that of A.

Now let us consider two more datasets.

Dataset C is: 10, 30, 40, 60.

Dataset D is: 31,33,36,40.

The mean of dataset C and dataset D is the same (35).

But what about standard deviation? Are they equal?

No, even though the means of C and D are equal, standard deviations are different.

The standard deviation of dataset C (20.8) is higher than dataset D (3.9).

In summary, to understand how the data are distributed, we need both the mean and standard deviation.

While the mean conveys the central point, the standard deviation tells us the spread of the data.

Z-score formula

The next concept is the Z-score. What is a Z-score?

In simple words, the Z-score combines both the mean and standard deviation of the data.

What is x here? x is a value. µ is the mean and σ is the standard deviation.

Now let us understand Z-scores with examples.

If the mean is 20, and the standard deviation is 2, then the Z score for x=20 is 0.

But if the value of x is far away from the mean, let us say x is 30, then the Z score is 5.

Similarly, if the value of x is on the lower side, that is, let us say x is 10, then also Z score is -5, that is minus 5.

The Z-score is positive for values on the right side of the mean and negative for values on the left side.

As you can see, a Z-score of 0 lies in the central part of the distribution.

Assuming the normal distribution, between -3 to +3 Z-score, 99.7% of data lie.

Advantages of Z-score:

To compare different datasets

Suppose there are two students: John and Peter. John scored 80 marks in one exam, while Peter scored 60 in another exam.

Though it appears that 80 is greater than 60, it may not be the case always.

What if the exam that Peter faced was tough?

In such cases, if we know how other students performed in these two exams, we can use this information for better comparison.

The Z-score uses mean and standard deviation.

Hence if we are told that Z-score for John is 1.5, while Peter's is 2.5.

Then we can understand that Peter performed better.

To identify outliers

Usually, a Z-score value greater than +3 or lower than -3 is considered outlier.

To find out the relative position of a particular value

Suppose the mean of a dataset is 30, it is normally distributed. Then for a value of 37, we can find out:

a) area under the curve

Using the Z-table, we can find out that, the area under the curve for x=37 is 0.96.

That means, 96% of values lie to the left of x=37.

b) Area between the mean and the value

Using the Z-table, we can find out that 46% of values lie between the mean of 30 and x=37.

c) Area beyond the value

Around 4% of the values lie beyond x=37.

Disadvantages of Z-score:

Z-score cannot be calculated for nominal (e.g. city names, zip code) or ordinal data (e.g. low, medium and high)
Assumption of normality though applies to the majority of the situations, but may not hold good always

In summary, Z-score is a standardized measure of how far a value is from the mean.

Z-scores can be useful to identify outliers or find the relative position.

Data Science Simplified

Z-Scores Explained with Examples: A Beginner's Guide

Popular Posts