Temporal analysis of topic modeling output by machine learning techniques

Hamed Vahdat; Nejad,Faezeh Azizi,Hamideh Hajiabadi

Authors	Hamed Vahdat-Nejad,Faezeh Azizi,Hamideh Hajiabadi
Journal	International Journal of Data Science and Analytics
Page number	2175-2225
Serial number	20
Volume number	3
Paper Type	Full Paper
Published At	2025
Journal Type	Typographic
Journal Country	Iran, Islamic Republic Of
Journal Index	Scopus
Keywords	Topic modeling, Natural language processing, Clustering, Social network, COVID, 19

Abstract

Topic modeling is widely recognized as one of the most effective and significant methods of unsupervised text analysis. This method facilitates identifying and extracting topics in document sets associated with various entities (e.g., countries, websites, journals, etc.). Nonetheless, the method's output lacks high-level information per entity. Applying machine learning methods to topic modeling outputs is generally challenging. Some studies have already applied machine learning methods statically, ignoring the effect of time on topic modeling outputs. The inclusion of time introduces additional complexity to the problem. This study introduces a novel approach to clustering the output of topic modeling per entity, considering the time factor. Topic popularity over time and the feature vector for each entity over time are proposed for this purpose. Due to the high dimensionality of the proposed feature vector, selecting an appropriate dimension reduction technique and the corresponding clustering algorithm may not be a straightforward task. This research proposes a new approach to selecting a dimensionality reduction method and its corresponding clustering technique. A case study is conducted on COVID-19-related tweets to evaluate the proposed method's performance. The proposed approach applies t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and Fuzzy C-Means (FCM) for clustering. While our study incorporates the time factor, unlike previous research, it also outperforms them in terms of the Davies-Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski-Harabasz Index (CHI), and Dunn Index (DI) parameters. The proposed method enables researchers in natural language processing to analyze topic dynamics across various entities, leading to improved research outcomes.

Paper URL

Hamed Vahdat-Nejad

Associate Professor Hamed Vahdat-Nejad

Temporal analysis of topic modeling output by machine learning techniques

Abstract