Enhanced instance selection for large-scale data using integrated clustering and autoencoder techniques

Hamid Saadatfar,Mohammad Nazari

نویسندگان	Hamid Saadatfar,Mohammad Nazari
نشریه	International Journal of Data Science and Analytics
شماره صفحات	5585-5602
شماره سریال	20
شماره مجلد	6
نوع مقاله	Full Paper
تاریخ انتشار	2025
نوع نشریه	الکترونیکی
کشور محل چاپ	ایران
نمایه نشریه	Scopus
کلید واژه ها	Instance selection; Large data; Clustering; Autoencoder; Classification.

چکیده مقاله

Instance selection plays a crucial role in improving the efficiency of machine learning models, especially when dealing with large datasets. Traditional instance selection methods often struggle to balance data reduction with preserving essential information, particularly in high-dimensional and complex datasets. This paper introduces a novel approach, instance selection by combining clustering and autoencoders (CAIR), designed specifically for large-scale data. CAIR addresses key gaps in the literature by integrating clustering techniques to group similar data points and using autoencoders to reduce dimensionality while retaining critical boundary instances. Unlike conventional methods that focus primarily on either boundary or inner instances, CAIR effectively balances the removal of redundant datawith the preservation of instances crucial for classification. Experimental results on 24 large datasets from the KEEL repository demonstrate that CAIR achieves superior data reduction while maintaining high classification accuracy compared to state-of-the-art methods, including k-nearest neighbor (KNN), edited nearest neighbors (ENN), DROP3, ATISA1, and RIS. CAIR fills a significant gap by providing an effective solution for large-scale data reduction without compromising performance.

لینک ثابت مقاله

حمید سعادت فر

دانشیار حمید سعادت فر

Enhanced instance selection for large-scale data using integrated clustering and autoencoder techniques

چکیده مقاله