| نویسندگان | Hamed Vahdat-Nejad,Faezeh Azizi,Hamideh Hajiabadi |
| نشریه | International Journal of Data Science and Analytics |
| شماره صفحات | 2175-2225 |
| شماره سریال | 20 |
| شماره مجلد | 3 |
| نوع مقاله | Full Paper |
| تاریخ انتشار | 2025 |
| نوع نشریه | چاپی |
| کشور محل چاپ | ایران |
| نمایه نشریه | Scopus |
| کلید واژه ها | Topic modeling, Natural language processing, Clustering, Social network, COVID, 19 |
|---|
چکیده مقاله
Topic modeling is widely recognized as one of the most effective and significant methods of unsupervised text analysis. This method facilitates identifying and extracting topics in document sets associated with various entities (e.g., countries, websites, journals, etc.). Nonetheless, the method's output lacks high-level information per entity. Applying machine learning methods to topic modeling outputs is generally challenging. Some studies have already applied machine learning methods statically, ignoring the effect of time on topic modeling outputs. The inclusion of time introduces additional complexity to the problem. This study introduces a novel approach to clustering the output of topic modeling per entity, considering the time factor. Topic popularity over time and the feature vector for each entity over time are proposed for this purpose. Due to the high dimensionality of the proposed feature vector, selecting an appropriate dimension reduction technique and the corresponding clustering algorithm may not be a straightforward task. This research proposes a new approach to selecting a dimensionality reduction method and its corresponding clustering technique. A case study is conducted on COVID-19-related tweets to evaluate the proposed method's performance. The proposed approach applies t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and Fuzzy C-Means (FCM) for clustering. While our study incorporates the time factor, unlike previous research, it also outperforms them in terms of the Davies-Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski-Harabasz Index (CHI), and Dunn Index (DI) parameters. The proposed method enables researchers in natural language processing to analyze topic dynamics across various entities, leading to improved research outcomes.
لینک ثابت مقاله