Normalization

Jump to bottom

Rahul Mondal edited this page Jun 3, 2020 · 8 revisions

Theory:

Elimination of Units of Measurement for Easier Comparison of Data from Different Places

Why do we do it?

In Machine Learning, NORMALIZATION is only required when features have widely varying ranges

The Datasaurus Dozen

Certain machine learning algorithms (such as SVM and KNN) are more sensitive to the scale of data than others since the distance between the data points is very important.
In order to avoid this problem we bring the dataset to a common scale while keeping the distributions of variables the same. This is often referred to as min-max scaling.
- Suppose we are working with the dataset which has 2 variables: height and weight, where height is measured in inches and weight is measured in pounds.
- Even prior to running any numbers, you realize that the range for weight will have larger values than the range for height.
- In our daily life we would think that the range for height can be somewhere between 65 and 75 inches (my assumption), while the range for weight can be somewhere between 120 and 220 pounds (also my assumption).

Methods:

Standardization: Transforming data into a z-score or t-score i.e. transform data to have a mean of 0 and standard deviation of 1
Feature Scaling: Rescaling data to have values between 0 & 1
Normalizing Moments: using the formula μ/σ

Useful Resources