SWlab Pioneers Semantic Document Clustering with Glove Word Embeddings and DBSCAN

The Semantic Web Lab (SWlab) at the University of Zakho has made significant strides in the field of document clustering with the publication of groundbreaking research at the 2020 International Conference on Advanced Science and Engineering (ICOASE).

The paper, titled “Glove word embedding and DBSCAN algorithms for semantic document clustering,” authored by Shapol M Mohammed, Karwan Jacksi, and Subhi RM Zeebaree, explores the use of novel algorithms, namely Glove word embeddings and the DBSCAN clustering algorithm, for enhanced document clustering.

Recognizing the crucial role of word embeddings in capturing semantic relationships, the research investigates the effectiveness of Glove embeddings in constructing semantic representations of documents. Unlike previous studies that primarily focused on Word2Vec, this work pioneers the use of Glove embeddings in conjunction with the DBSCAN clustering algorithm.

The research methodology involved preprocessing Wikipedia and IMDB datasets with and without stemming, followed by applying the Glove word embedding algorithm to generate word vectors. These word vectors were then utilized as input for the DBSCAN clustering algorithm to group documents based on their semantic similarity.

To evaluate the performance of the proposed approach, a comprehensive set of seven metrics were employed: Silhouette average, purity, accuracy, F1-score, completeness, homogeneity, and NMI score. The results were compared against those obtained using traditional methods like TF-IDF and K-means on six different datasets. The findings demonstrate that the proposed approach using Glove word embeddings and DBSCAN significantly outperforms existing methods in terms of clustering accuracy and effectiveness.

This research represents a significant advancement in the field of document clustering, showcasing the potential of novel algorithms like Glove and DBSCAN in improving the accuracy and efficiency of information organization and retrieval.

Leave a Reply

Your email address will not be published. Required fields are marked *