SMOTE vs Deep Augmenter — testing the predictive power on imbalanced data

Lasse Schmidt
Analytics Vidhya
Published in
4 min readFeb 20, 2022

--

Photo by kazuend on Unsplash

In this blog I want to test the Deep Learning Augmentation approach vs the popular SMOTE approach. For this I again use the credit card fraud dataset. This dataset is highly imbalanced meaning we have way less fraud cases than non-fraud cases. The goal is to train a RandomForest model to predict non-fraud vs fraud cases. For this I will use three approaches:

  • assigning a higher weight to the fraud cases (in-built in the sklearn model for RandomForests)
  • using the SMOTE approach as implemented here
  • using the Deep Augmenter

How to make use of the Deep Augmenter package I already wrote here and here. In the credit card fraud dataset we have 199.032 non-fraud cases vs 332 fraud cases. So let’s prepare the Deep Augmenter first, loading the data, selecting only the data we want to build fake data for (in this case the fraud cases), splitting it into train vs test sets and putting them into dataloaders. We then built our model:

After our model is trained we can now use it to create data of fraud cases:

One nice thing here is that we can add the sigma parameter per column. Sigma defines the standard deviation, we pass it as a list, and if we want some of the columns to have a specific standard deviation you can just put it in the list. I did use the standard deviation from the fraud cases but only 25% of it, as this seem to give the data enough “wiggle” room while still keeping the trained relations between the variables (check out my blogpost on the latent factors here).

Now we use the SMOTE package to create data. SMOTE just needs the (complete) trainset and target values for this trainset:

So basically SMOTE creates as many entries of the minority class until we have the same amount of cases in both classes. We can do this with the DeepLearning Augmenter as well:

In both cases we now have 199.032 fraud cases simulated. While SMOTE creates these synthetic data in almost no-time, the DeepLearning Augmentation takes about a minute.

Train the Random Forests

We want to compare how the built-in class_weight functionality performs vs the new approach vs SMOTE. So we will build three trainsets: the original one, the one with additional data from SMOTE, and the one with additional data from DeepLearning Augmentation.

First, let’s train model on the original data while using the differences in class occurences as weights.

We can also define our Random Forests for the SMOTE and Deep Augmenter approach:

Let’s run all of these models:

Then finally let’s compare the results. First, the class_weight approach:

Second, the SMOTE result:

Lastly, the Deep Augmenter result:

We see quite huge differences between the three approaches. Simply attaching a higher weight to the fraud-class didn’t help at all, we were only able to identify 4 fraud cases correctly. The SMOTE approach lead to way more identified fraud cases, it found 125 cases out of 136. This leads to an recall of 0.92. However, this comes at a cost: we have astonishing 1283 missclassified fraud cases, leading to a precision in the fraud case of 0.09. The Deep Learning Augmentation correctly predicted 107 out of the 136 cases while only misclassifying 58 cases — this leads to a precision of 0.65 in the fraud case.

To conclude, I think this blog was able to show the merits of Deep Learning Augmentation. While increasing the correctly identified fraud cases we were also able to keep a high precision in this case, meaning we only have a few misclassified cases. While SMOTE was able to correctly identify a few more fraud cases it did also create a huge amount of misclassified fraud cases which might be costly when it comes to resource allocation.

If you have any questions or want anything added to the package, just ask me.

Lasse

Originally published at https://lschmiddey.github.io on February 19, 2022.

--

--