Two density-based sampling approaches for imbalanced and overlapping data

AuthorsHamid Saadatfar
JournalKnowledge-Based Systems
Page number108217-108217
Serial number241
Volume number1
Paper TypeFull Paper
Published At2022
Journal GradeISI
Journal TypeElectronic
Journal CountryIran, Islamic Republic Of
Journal IndexJCR،Scopus

Abstract

Abstract - An imbalanced dataset consists of a majority class and a minority class, where the former’s sample size is substantially larger than other classes. This difference disrupts the data learning process and drives the learning algorithms into modeling the majority class. Data overlap can exacerbate the complicated problem of imbalanced datasets, a problem for which oversampling and undersampling approaches are adopted. This paper proposes two novel density-based algorithms in order to eliminate the overlap between two classes and the noise, as well as creating balance and normalizing the class distribution. The first algorithm employs an undersampling technique, whereas the second one uses undersampling and oversampling techniques simultaneously. These two algorithms delete high-density samples from the majority class and eliminate the noises in both classes. The two proposed algorithms and other popular algorithms were run on 16 imbalanced datasets that included a variety of scenarios. The datasets balanced by these algorithms were then modeled through Random Forest, and SVM classifiers. The models obtained from the two proposed algorithms outperformed the other algorithms in all criteria. These models also achieved a balance by maximum maintenance of the class structure and form, which protects the quality of learning from any detriment.

Paper URL

tags: Imbalanced dataset; Density; Undersampling; Oversampling; Overlapping