In most of the real time problems(such as Fraud detection, Spam filtering, Disease screening, SaaS subscription churn, Advertising click-throughs, etc), the problem of imbalanced data occurs. The presence of imbalanced data in the real world scenario is not only common but expected. Imbalanced data is data with disproportionate distribution of classes.
For example : If you are working with a binary classification problem in which the ratio of Class A to Class B is 100:10 that is the number of samples of class A is more than class B. Due of this, the results overfit class A with high accuracy most of the times and you will feel frustrated when you found that all of the great results you thought you were getting turn out to be a lie. In such cases, machine learning does not produce satisfactory classifiers.
This will make the prediction difficult and there is a need to treat the imbalanced data first. There are several approaches to treat the imbalanced data and balance the classes.
You can change your data into a balanced form by performing artificial sampling. There are several methods that are described below:
Undersampling is the process by which observations belonging to majority class are randomly deleted to make the make the minority class observations nearly equal to majority class observations. Balanced data are good for classification, but you obviously loose information about appearance frequencies using this method, which is going to affect accuracy metrics themselves, as well as performance.
Oversampling is the process by which observations are generated belonging to minority class by randomly duplicating/replicating the observations of minority class. In this case, although there is no information loss but duplicating the minority data to increase the data will increase the likelihood of overfitting.
SMOTE(Synthetic Minority Over-sampling Technique) involves the introduction of new synthetic samples rather than creating new observations by replication. Although SMOTE is better than simple oversampling, still it loose its effectiveness when you have high dimensional data.
Through extensive experimentation, it is found that these algorithms are good enough to balance the data but there are several drawbacks due to which they are not effectiveness in some cases. So it is advised to use the hybrid of techniques rather than relying on the any of the single technique or you can modify the basic sampling approaches.
We do have a number of approaches that can help us to balance our balanced. But the question is do we really need to balance the data? Because when we are dealing with the imbalanced data, it is obvious to have imbalanced data. Although balancing the data will increase the accuracy, does it really provide a effective and robust classifier?
As we can see from the different techniques that they either remove the observations(which leads to loss of data) to balance the data or replicate or synthetically increase the data(which generate observations that are not genuine). It is really important to understand various facts before balancing the data.
Whether to balance the data or not is entirely dependent on the type of use case you are working on, reasons for the missing data or the type of ML algorithms that you are using for training.
- If you want to build a predictive model, using basic machine learning models such as Decision Tree, SVM, Naive Bayes, etc and using Area under curve, rank, etc as performance parameters then balanced data is a good choice.
- If you want to build a predictive model using a advanced machine learning framework i.e. something that determines sampling parameters via wrapper or a modification of a bagging framework that samples to class equivalence, then the representative sample is a good choice. As advanced machine learning framework take care of balancing training data on its own.
- If you want to build a representative model -- representative model can be a unsupervised model that describes the data rather than necessarily predicts -- then representative sample of data is a good choice.
When we have the unbalanced data it is not always necessary to balance that sample. There are several other things that should be considered before balancing the data. It is really important analysis the type of use case, model and data that you are using in the experiment.