| Authors | Hamed Vahdat-Nejad,Faezeh Azizi,Hamideh Hajiabadi |
| Journal | International Journal of Data Science and Analytics |
| Page number | 2175-2225 |
| Serial number | 20 |
| Volume number | 3 |
| Paper Type | Full Paper |
| Published At | 2025 |
| Journal Type | Typographic |
| Journal Country | Iran, Islamic Republic Of |
| Journal Index | Scopus |
| Keywords | Topic modeling, Natural language processing, Clustering, Social network, COVID, 19 |
|---|
Abstract
Topic modeling is widely recognized as one of the most effective and significant methods of unsupervised text analysis. This method facilitates identifying and extracting topics in document sets associated with various entities (e.g., countries, websites, journals, etc.). Nonetheless, the method's output lacks high-level information per entity. Applying machine learning methods to topic modeling outputs is generally challenging. Some studies have already applied machine learning methods statically, ignoring the effect of time on topic modeling outputs. The inclusion of time introduces additional complexity to the problem. This study introduces a novel approach to clustering the output of topic modeling per entity, considering the time factor. Topic popularity over time and the feature vector for each entity over time are proposed for this purpose. Due to the high dimensionality of the proposed feature vector, selecting an appropriate dimension reduction technique and the corresponding clustering algorithm may not be a straightforward task. This research proposes a new approach to selecting a dimensionality reduction method and its corresponding clustering technique. A case study is conducted on COVID-19-related tweets to evaluate the proposed method's performance. The proposed approach applies t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and Fuzzy C-Means (FCM) for clustering. While our study incorporates the time factor, unlike previous research, it also outperforms them in terms of the Davies-Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski-Harabasz Index (CHI), and Dunn Index (DI) parameters. The proposed method enables researchers in natural language processing to analyze topic dynamics across various entities, leading to improved research outcomes.
Paper URL