Comparison — RandomForest with Oversampling vs Augmented Data

Lasse Schmidt
Analytics Vidhya
Published in
3 min readJan 26, 2022

--

Photo by Luca Bravo on Unsplash

In this blog I’d like to show the difference deep tabular augmentation can have when training a Random Forest on a highly biased data base. In this case, we have a look at creditcard fraud, where fraud itself is is way less represented than non-fraud. The dataset is available here.

Let’s have a look of how many more non-frauf cases we have compared to fraud cases:

Overall, we have 284.807 rows, of which are 283.823 non-fraud cases. In order to make use of the deep tabular augmentation we need to scale the data and then use only those cases, in which class we are interested in, in this case “Class” is equal to 1.

For our model to work we need to put our data in a DataLoader (here I use the DataBunch Class from deep data augmentation).

Now we’re already good to go. We can define our Variational Encoder Architecture (here: 50 –12–12–5–12–12–50) and then use the LearningRate Finder to tell us the best Learning rate:

We set up a desirable learning rate and scheduler for our learning rate:

Now, let’s train the model:

Let’s see how the created data looks like:

Train Random Forest

We want to compare how the built-in class_weight functionality performs vs the new approach (spoiler: if you do not use any weights the RandomForest will always predict 0). Hence, we create three dataframes: the original, the original appended with fake_data, the original appended with fake data with noise.

To make things easier to understand, let’s define the datasets on which to train and on which to assess the results:

First, let’s train model on the original data while using the differences in class occurences as weights.

Then, we use the augmented dataframe:

Wow, I think that is quite astonishing. We managed to highly increase the number of fraud cases we are able to detect. Moreover, we achieved these results without any finetuning of the model architecture and simply using the default structure of the VAE.

I hope this blog shed some light on why using this approach on highly biased data is worth a shot trying.

Also, if you want to try it for yourself, I made this a package you can pip install and just use. Here is the repo.

Lasse

Originally published at https://lschmiddey.github.io on January 24, 2021.

--

--