Machine Learning Fundamentals: Standardization and Normalization

4 min readMay 28, 2024

Standardization and normalization are both feature scaling techniques used a lot in machine learning. These two concepts are often used interchangeably in the machine learning world. However, there are significant differences between them. How do they differ from each other? First, let’s look at what feature scaling means.

Feature Scaling

It is a technique used to adjust the range of the features of data to a specific scale. The goal is to ensure that each feature contributes equally to the training by bringing them onto a common scale without distorting differences in the ranges of values. Two of the most common methods of feature scaling are normalization and standardization.

LEFT: Unscaled data ……………………………………. RIGHT: Scaled data

From the graphs, it can be deduced that the range of values of the data changes scaling and the distance between respective data points also changes. Algorithms that rely on distance calculations, like k-nearest neighbors (KNN) and k-means clustering, can measure similarities more fairly because all features contribute equally to the distance computations. Looking at the unscaled data, the horizontal distance is far longer than the vertical making it more dominant, making it seem more relevant than the vertical data point.

Normalization

Normalization is simply transforming a numerical data or feature to reside within a specific range, typically 0 and 1. You perform this transformation when you want your values to be on the same scale. For example, if we have house prices ranging from $50,000 to $1,000,000, normalization would rescale these values to fit between 0 and 1 which shrinks the magnitude of the values. The most common form of normalization is the min-max normalization.

This transformation is often suitable for algorithms that are sensitive to the magnitude(how big or small) of values and where the data does not necessarily follow a Gaussian distribution. Outliers will greatly affect the skewness of the data.

Intuitively normalization is supposed to perform better with these algorithms: K-Nearest Neighbor, K-Nearest Clustering, Tree-Based Ensemble algorithms like XGBoost and Neural Networks

x_normalized = x-x.min()/x.min()-x.max()

LEFT: Original data……………………………………RIGHT: Normalized

From the graph, it is evident nothing but only the scale of the data changes from 0–100 to 0–1.

Standardization

Like normalization, standardization also shrinks the scale, however, it shifts the data such that it has a mean of 0 and a standard deviation of 1. Standardization is also known as z-score normalization.

Standardization is more effective when the data follows a Gaussian distribution or when the algorithm you are using assumes a Gaussian distribution such as support vector machines, linear regression, and principal component analysis. By standardizing data, you’re not only centering it around zero but also scaling it by the standard deviation, which can make the data more “normal” in statistical terms.

Unlike normalization, standardization is less sensitive to outliers.

z = x-x.mean()/x.var()

LEFT: Original data……………………………………… RIGHT: Standardized data

After standardization, we see from the graphs that only the scale shifts. Algorithms that rely on gradient descent (like linear regression, logistic regression, and principal component analysis(PCA)) converge faster because each feature contributes equally to the distance calculations, without any feature dominating because of its scale.

Note: Contrary to popular belief, standardization does not transform the data to a normal/gaussian distribution. To transform the shape of the data, log or square/cubic roots can be applied to the data.

Advantages of Feature Scaling

1. Speeds up Gradient Descent: Scaling helps gradient descent converge to the minimum more quickly by ensuring that all features contribute equally, avoiding skewed gradients due to features with larger scales.

2. Equal Importance to All Features: Ensures no single feature dominates in distance-based algorithms like SVM and KNN, allowing all features to have an equal impact on the model’s decisions.

3. Improves Algorithm Performance: Algorithms that assume data is normally distributed or that all features are on the same scale will perform better when these conditions are met, such as PCA.

Concretely, choosing between normalization and standardization depends on factors such as the nature of data and its distribution, the presence of outliers, the type of scale to use, etc. Most times it is better to test the two and choose the one that works better.

Machine Learning Fundamentals: Standardization and Normalization

Feature Scaling

Normalization

Standardization

Advantages of Feature Scaling

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Kaku

No responses yet

More from Kaku

Trying Out Both Random Forest & Decision Tree Classifier On MNIST Using Scikit-Learn

It’s just a piece of cake. Let’s go!!

Machine Learning Fundamentals: The Bias and Variance Trade-off Clearly Explained

Understanding Bias and Variance is necessary to training a robust and effective model. Without a good knowledge of bias and variance, your…

WHAT ARE LLMs(Large Language Models, eg: ChatGPT)? Applications And Their Potential Risks

Large Language Models , Their Applications And How They Will Affect The Work Industry

Machine Learning, Artificial Intelligence, and Deep Learning. How Are They Related?

Let’s dive into how these fields leading the 4th industrial revolution are closely related

Recommended from Medium

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

My Data Scientist — 2 Interview Experience at Zepto

Hey everyone! 👋

Confident about Regularization? These 6 critical mistakes almost destroyed my dream ML project.

Ever added regularization only to watch your model crash and burn? Yeah, been there.

Friendly Introduction to Deep Learning Architectures (CNN, RNN, GAN, Transformers, Encoder-Decoder…

This blog aims to provide a friendly introduction to deep learning architectures involving Convolutional Neural Networks (CNN), Recurrent…

Gradient Descent: Explanation with Python Code

Gradient descent is one of the algorithms considered to form the basis of machine learning and optimization. It is the basis for many…

Basic AI & ML Concepts for MLOps Engineers

There’s a lot of misunderstanding (or no understanding at all) of AI & ML , before jumping deep into the world of MLOps, let’s clear those…