Two density-based sampling approaches for imbalanced and overlapping data

نویسندگانHamid Saadatfar
نشریهKnowledge-Based Systems
شماره صفحات108217-108217
شماره سریال241
شماره مجلد1
نوع مقالهFull Paper
تاریخ انتشار2022
رتبه نشریهISI
نوع نشریهالکترونیکی
کشور محل چاپایران
نمایه نشریهJCR،Scopus

چکیده مقاله

Abstract - An imbalanced dataset consists of a majority class and a minority class, where the former’s sample size is substantially larger than other classes. This difference disrupts the data learning process and drives the learning algorithms into modeling the majority class. Data overlap can exacerbate the complicated problem of imbalanced datasets, a problem for which oversampling and undersampling approaches are adopted. This paper proposes two novel density-based algorithms in order to eliminate the overlap between two classes and the noise, as well as creating balance and normalizing the class distribution. The first algorithm employs an undersampling technique, whereas the second one uses undersampling and oversampling techniques simultaneously. These two algorithms delete high-density samples from the majority class and eliminate the noises in both classes. The two proposed algorithms and other popular algorithms were run on 16 imbalanced datasets that included a variety of scenarios. The datasets balanced by these algorithms were then modeled through Random Forest, and SVM classifiers. The models obtained from the two proposed algorithms outperformed the other algorithms in all criteria. These models also achieved a balance by maximum maintenance of the class structure and form, which protects the quality of learning from any detriment.

لینک ثابت مقاله

tags: Imbalanced dataset; Density; Undersampling; Oversampling; Overlapping