Data Preprocessing: Transformation

Data preprocessing is an important step before fitting any model. The following steps are performed under data preprocessing:

In this post, with the help of an example, let us explore transformation:

  • Standardization
  • Normalization
  • Log transformation
  • How to transform data in Python

Example data

The example data contains four columns. In fact, the last two columns are derived from first two columns. Height (cm) and height (m) measure the same thing only thing that is different is the unit. The is the case with Weight (g) and Weight (kg). 


1. Standardization


This is the most common transformation used. All the observations are subtracted by the mean of that column and then divided by the standard deviation of that column.

Using the sklearn StandardScaler option, let us standardize the four columns of our example data set.

If we want to scale only using mean not standard deviation, or if we want to scale only using standard deviation but not using mean, we can use the relevant option (as shown Out [5]). By default, both with_mean and with_std are set to True.


If we check the mean and standard deviation, these are 0 and 1 respectively.
If you estimate regression coefficients using standardized features, you can directly compare regression coefficients. Higher is the value of the coefficient higher is its predictive power or the influence on the dependent variables.

Standardization is necessary in case of: 
  • RBF kernel of Support Vector Machines 
  • L1 and L2 regularizers of linear models
If there are outliers, better to use RobustScaler or QuantileTransformer.

2. Normalization


Unlike standardization, normalization is per sample transformation not per feature transformation.

This transforms the data to unit norms using the l1’, ‘l2’, or ‘max’ norms. 


In case of l1 norm, the sum of observations in each rows will be one (as shown in the pic below). In case of l2 norm, the square root of the sum of the squares of in each row will be one.


3. Log transformation


Log transformation is more common in time series data. Log transformation also helps to handle outliers when data is skewed to the right. For applying log transformation, data need to be positive and non-zero.
Log transforming the right skewed data

This is how we can log transform the data (natural log).

Another transformations is Box-Cox transformation. In this case also, input data should be be positive.

Summary


In this post, we have explored:
  • Standardization
  • Normalization
  • Log transformation
  • And how to perform these transformations in Python

If you have any questions or suggestions, feel free to share. I will be very happy to interact with you.