Temporal analysis of topic modeling output by machine learning techniques

نویسندگانHamed Vahdat-Nejad,Faezeh Azizi,Hamideh Hajiabadi
نشریهInternational Journal of Data Science and Analytics
شماره صفحات2175-2225
شماره سریال20
شماره مجلد3
نوع مقالهFull Paper
تاریخ انتشار2025
نوع نشریهچاپی
کشور محل چاپایران
نمایه نشریهScopus
کلید واژه هاTopic modeling, Natural language processing, Clustering, Social network, COVID, 19

چکیده مقاله

Topic modeling is widely recognized as one of the most effective and significant methods of unsupervised text analysis. This method facilitates identifying and extracting topics in document sets associated with various entities (e.g., countries, websites, journals, etc.). Nonetheless, the method's output lacks high-level information per entity. Applying machine learning methods to topic modeling outputs is generally challenging. Some studies have already applied machine learning methods statically, ignoring the effect of time on topic modeling outputs. The inclusion of time introduces additional complexity to the problem. This study introduces a novel approach to clustering the output of topic modeling per entity, considering the time factor. Topic popularity over time and the feature vector for each entity over time are proposed for this purpose. Due to the high dimensionality of the proposed feature vector, selecting an appropriate dimension reduction technique and the corresponding clustering algorithm may not be a straightforward task. This research proposes a new approach to selecting a dimensionality reduction method and its corresponding clustering technique. A case study is conducted on COVID-19-related tweets to evaluate the proposed method's performance. The proposed approach applies t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and Fuzzy C-Means (FCM) for clustering. While our study incorporates the time factor, unlike previous research, it also outperforms them in terms of the Davies-Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski-Harabasz Index (CHI), and Dunn Index (DI) parameters. The proposed method enables researchers in natural language processing to analyze topic dynamics across various entities, leading to improved research outcomes.

لینک ثابت مقاله