[SOLVED] How to handle Class Imbalance Problem in Machine Learning?

Hello dear programmers,

Class Imbalance Problem is seen very frequently in machine learning classification datasets where some classes have very high number of samples/rows whereas other classes have very low number of samples/rows.

 


Due to this problem F1 Score of each class after training is not similar whereas accuracy is becomes high. And we can be easily deceived if we are only looking at accuracy metrics and ignoring Precision, Recall and F1 Score metrics.

To read more about Precision, Recall metrics: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall 

The solution to this problem is SMOTE (Synthetic Minority Oversampling Technique). This algorithm applies synthetic data to your dataset and makes the samples/rows of each class equal to each other without loosing any information.

Here is the implementation of the algorithm in Python:

       

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))



       
 

If you have any questions, feel free to ask in comments.

Happy coding :)

Comments