Skip to content

Latest commit

 

History

History
51 lines (43 loc) · 4.17 KB

README.md

File metadata and controls

51 lines (43 loc) · 4.17 KB

Synthetic case study on the state-of-the-art samplers for imbalanced learning

This repository provides the code necessary to design an graphical analysis of the best sampling methods for imbalanced classification. With it, a binary synthetic data set with a chess board distribution can be constructed with the number of instances and the imbalanced ratio desired, tuning the parameters of the program. This dataset is preprocessed by the most relevant methods published in Python and CRAN of R. The results are plotted together with the classification surfaces inferred by the Scikit-Learn's decision tree.

The repository contains the following files:

  • plot_synthetic.py generates the synthetic data and executes all the sampling methods of the imblearn package. Its parameters goes as followed:

    • 1st parameter (div): shape of the chess board.
    • 2nd parameter (N): number of instances for the balanced dataset (N/2 for each class).
    • 3rd parameter (per): percentage of instances that conform the imbalanced data set (value in [0,1]).
  • plot_syntheticMWMOTE.py does the same as plot_synthetic.py, but with the MWMOTE method.

  • MWMOTE.py implements the MWMOTE method provided in its GitHub repo.

  • smotesData.R executes other important over-sampling methods implemented in the smotefamily package of R and ROSE

  • plot_file.py plots the results obtained with smotesData.R, giving the generated files as parameter.

Included Methods and some examples

Starting from the a 4x4 Chess data with 1000 instances and 10% of the minority class (div=5; N=1000; per=0.1): DSoriginal DSimbalanced

Over-sampling methods

Under-samping methods