Title

Document Clustering on Collection of Multidisciplinary Academic Texts: Exploring Document Embeddings and Clustering Techniques

Abstract

Abstract

A large collection of unstructured documents is mostly not comprehensible enough at first stage for people who need to analyse and evaluate these documents. Before working on these pile of documents, the first step is to group by their similarity. Clustering algorithms fit best for these kinds of batch job. Also, nowadays, by using embedding techniques, we could represent the words, sentences, and documents as multi-dimensional vectors. So, we can compare and cluster those documents using these representations.

In this thesis, we have run embedding techniques to cluster multidisciplinary papers from different contexts. A collection of papers of academicians from Cost Action CA18110 project is clustered by different clustering methods, Agglomerative, K-Means and DBSCAN according to text representations created using distributed text representation techniques. The best methods chosen based on the experiments on the mentioned dataset are run on one small and one big text collection which contain scientific papers as well and the results are observed. While doing that, the hyper parameters of different embedding techniques were optimized to get better clustering results by applying several trials. Also, the parameters of the clustering methods are tried to be optimized to give better clustering result according to silhouette score. With a visualisation and experiment tool, lots of wise trials are performed, the results are evaluated with different clustering scores and visualised with dimensionality reduction techniques.

Supervisor(s)

Supervisor(s)

MURAT KARA

Date and Location

Date and Location

2022-09-05 16:00:00

Category

Category

MSc_Thesis