Class imbalance - scale image

Handling Class Imbalance

Often data sets have classes that are not equally represented.  This class imbalance can be a source of frustration when you realize the accuracy of your first attempt at training a machine learning model simply reflects the percentage of the most common class.  Fortunately Vaimal has features that allow you ways to overcome class imbalance and hopefully improve model performance.  In this article we will discuss techniques that can be used with Vaimal.

Sampling Methods for Class Imbalance

A good starting point is to employ sampling methods that alter the training data set to provide more class balance.  Vaimal has three techniques to accomplish this goal.

Over Sampling

With over sampling, all of the original data is used plus additional duplicates of the data.  For under-represented classes we randomly re-sample data from these classes and use the copies to balance out the classes.  In the following example we have two classes:

Red: 1000 data points
Green: 100 data points

To over sample we randomly pick instances of the green data and make copies.  These copies are added to the original data before training the model.  Over sampling doesn’t necessarily mean making 900 copies of green data so that each class has 1000 instances.  We may want to increase the green class so it has more instances, but not as many as the red class.  This is accomplished using the balance ratio.  Defining class 1 as the class with the most instances, we can determine the other class counts with the following.

balance ratio = class 1 cases ÷ class n cases

To determine the training cases for classes without the most instances, we rearrange the equation above:

class n cases = class 1 cases ÷ balance ratio

Balance ratio is ≥ 1.  If balance ratio is 1, all classes will have the same number of instances.  A balance ratio greater than 1 will result in the class with the most instances still having more instances than other classes.

Over-sampling is often the preferred method since no information in the training data is being discarded.  This is especially true with small data sets.

Under Sampling

The under sampling method throws out some data to balance classes.  Under-sampling is more appropriate for situations where there is an ample amount of data since some data is being discarded.  In our red/green example, we randomly select instances of the red class and discard data points.  All of the green class is used, but the red class is under-sampled when the data is used to train the model.  Again, we use a balance ratio to control the amount of under-sampling, but it has a different meaning.  Now we define class 1 as the class with the least instances.

balance ratio = class n cases ÷ class 1 cases

To determine the training cases for classes without the least instances, we rearrange the equation above:

class n cases = class 1 cases * balance ratio

Balance ratio is ≥ 1.  If balance ratio is 1, all classes will have the same number of instances.  A balance ratio greater than 1 will result in the class with the least instances still having less instances than other classes.

Balance Sampling

Balance sampling is a hybrid of the previous two methods.  Some classes are over-sampled and some are under-sampled.  The average data count of the classes with the most and least instances is found.  In the red/green example, this average would be (1000 + 100)/2 = 550.  The red class is under-sampled so that 550 instances are used for training.  The green class is over-sampled so that 550 instances are used for training.

Combining Models – Ensemble Methods

Ensemble methods allows us to combine models into a “meta-model” that can provide better performance than a single model.  This is true for class imbalance as well.  Vaimal has two ensemble methods available.

Bagging Ensembles

Bagging ensembles, aka bootstrap aggregating, combine component models that are trained on different training data.  For each model, its training data is sampled with replacement from the data set.  Some data points are represented more than once and some not at all.  This process is repeated for each model which results in different training data each time.

For classification, the ultimate prediction is made by majority voting of the component models.  As an example, suppose we have ten component models and the models predict the following:

2 models predict class 1

7 models predict class 2

1 model predicts class 3

Class 2 will be chosen by the ensemble since the most models predicted it.

For regression, the average of all component model predictions is used to predict output.

Voting Ensembles

Voting ensembles combine component models in the most general way.  Models can be different algorithms and different class imbalance sampling schemes.  For example, some models can be trained with over-sampling, some with under-sampling, some with just the original data, etc.  Voting ensembles work by majority voting for classification and averaging for regression.

Use a Different Algorithm

Finally, avoid the temptation to use the same algorithm for every problem.  You may have a favorite, or one that you understand well, but a different algorithm may provide better performance for class imbalance.  Experiment with different techniques to see if they give better performance based on your important success measures such as accuracy, sensitivity, precision, etc.