Synthetic case study on the state-of-the-art samplers for imbalanced learning

This repository provides the code necessary to design an graphical analysis of the best sampling methods for imbalanced classification. With it, a binary synthetic data set with a chess board distribution can be constructed with the number of instances and the imbalanced ratio desired, tuning the parameters of the program. This dataset is preprocessed by the most relevant methods published in Python and CRAN of R. The results are plotted together with the classification surfaces inferred by the Scikit-Learn's decision tree.

The repository contains the following files:

plot_synthetic.py generates the synthetic data and executes all the sampling methods of the imblearn package. Its parameters goes as followed:
- 1st parameter (div): shape of the chess board.
- 2nd parameter (N): number of instances for the balanced dataset (N/2 for each class).
- 3rd parameter (per): percentage of instances that conform the imbalanced data set (value in [0,1]).
plot_syntheticMWMOTE.py does the same as plot_synthetic.py, but with the MWMOTE method.
MWMOTE.py implements the MWMOTE method provided in its GitHub repo.
smotesData.R executes other important over-sampling methods implemented in the smotefamily package of R and ROSE
plot_file.py plots the results obtained with smotesData.R, giving the generated files as parameter.

Included Methods and some examples

Starting from the a 4x4 Chess data with 1000 instances and 10% of the minority class (div=5; N=1000; per=0.1):

Over-sampling methods

ADASYN (imblearn package, default parameters)
BLSMOTE (smotefamily R package, default parameters)
DBSMOTE (smotefamily R package, default parameters)
MWMOTE (MWMOTE GitHub repo, #Synthetic(N)=400)
ROSE (ROSE R package, hmult.majo=0.1, hmult.mino=0.1)
RSLS (smotefamily R package, default parameters)
SLS (smotefamily R package, default parameters)
SMOTE (imblearn package, default parameters)
SMOTEENN (imblearn package, default parameters)
SMOTETomek (imblearn package, default parameters)

Under-samping methods

IHT (imblearn package, default parameters)
NCL (imblearn package, n_neighbors=20)
OSS (imblearn package, k=1, n_seeds_S=100)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Synthetic case study on the state-of-the-art samplers for imbalanced learning

Included Methods and some examples

Over-sampling methods

Under-samping methods

Files

README.md

Latest commit

History

README.md

File metadata and controls

Synthetic case study on the state-of-the-art samplers for imbalanced learning

Included Methods and some examples

Over-sampling methods

Under-samping methods