Temporal analysis of topic modeling output by machine learning techniques

Hamed Vahdat; Nejad,Faezeh Azizi,Hamideh Hajiabadi

نویسندگان	Hamed Vahdat-Nejad,Faezeh Azizi,Hamideh Hajiabadi
نشریه	International Journal of Data Science and Analytics
شماره صفحات	2175-2225
شماره سریال	20
شماره مجلد	3
نوع مقاله	Full Paper
تاریخ انتشار	2025
نوع نشریه	چاپی
کشور محل چاپ	ایران
نمایه نشریه	Scopus
کلید واژه ها	Topic modeling, Natural language processing, Clustering, Social network, COVID, 19

چکیده مقاله

Topic modeling is widely recognized as one of the most effective and significant methods of unsupervised text analysis. This method facilitates identifying and extracting topics in document sets associated with various entities (e.g., countries, websites, journals, etc.). Nonetheless, the method's output lacks high-level information per entity. Applying machine learning methods to topic modeling outputs is generally challenging. Some studies have already applied machine learning methods statically, ignoring the effect of time on topic modeling outputs. The inclusion of time introduces additional complexity to the problem. This study introduces a novel approach to clustering the output of topic modeling per entity, considering the time factor. Topic popularity over time and the feature vector for each entity over time are proposed for this purpose. Due to the high dimensionality of the proposed feature vector, selecting an appropriate dimension reduction technique and the corresponding clustering algorithm may not be a straightforward task. This research proposes a new approach to selecting a dimensionality reduction method and its corresponding clustering technique. A case study is conducted on COVID-19-related tweets to evaluate the proposed method's performance. The proposed approach applies t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and Fuzzy C-Means (FCM) for clustering. While our study incorporates the time factor, unlike previous research, it also outperforms them in terms of the Davies-Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski-Harabasz Index (CHI), and Dunn Index (DI) parameters. The proposed method enables researchers in natural language processing to analyze topic dynamics across various entities, leading to improved research outcomes.

لینک ثابت مقاله

حامد وحدت نژاد

دانشیار حامد وحدت نژاد

Temporal analysis of topic modeling output by machine learning techniques

چکیده مقاله