Why it makes a difference how to standardize training and test set

Lasse Schmidt
Analytics Vidhya
Published in
3 min readOct 25, 2021

--

Photo by Call Me Fred on Unsplash

In this blogpost I want to briefly show why it is important to correctly scale you train and test data. Although I think most machine learning practicioners automatically avoid the fallacy of not standardizing the test data with the learned scaler from the trainset, I think many practitioners do not know exactly why. Here, I will give a concrete example of why you need to use the scaler from the trainset for the testset as well.

Let’s first create some dummy data. For example, we can assume the following are 3 different users, decribes by three variables. We also create the targets, which for example we can think of different clusters each user belongs to:

You see that in the test_data user 1 and user 3 are exactly alike. This is on purpose. Let’s create two standard scalers, meaning we will substract the mean and divide by the variance.

You can already see the difference between the two approaches. The first two scaled_data are built with the fit_transform method from the StandardScaler, while the last approach uses the “trained” scaler from the trainset to scale the test_dataset. Let’s have a look at the different means and variance:

And this is how the data looks like:

From first glance, the correctly scaled test data looks wrong, simply because the numbers seem so far off. However, let’s have a look what happens when we fit a simple Linear Regression on our trainset and make predictions on the testsets:

Remember, in our testset, we expect user 1 and user 3 to be classified as 2 and 0 respectively.

But this did not happen at all here. We see that neither the first user nor the third user are classified correctly, even though they are exactly the same. Let’s see what the results are, when using the scaler from our trainset:

This time, the Linear Model correctly classified user 1 and user 3. So you see what a difference wrongly used scaling makes. So this is why we use the mean and variance from the trainset to standardize the testset.

Lasse

Originally published at https://lschmiddey.github.io on October 25th, 2021.

--

--