Abstract
Abstract
Due to the abundance of drug candidates, conducting in-lab experiments to find an
effective compound for a given target is a costly and time-consuming task in drug dis-
covery. This thesis aims to reduce the number of drug candidates during early drug
discovery by clustering the compounds. ChemBERTa, a Bidirectional Encoder Rep-
resentation from Transformers (BERT) model, is employed to extract the descriptors
for a compound. The compounds are clustered with respect to the learned features,
and several clustering algorithms, including the k-means clustering algorithm and the
Butina algorithm, are used. Finally, obtained clusters are evaluated by measures such
as Silhouette Score. Our empirical findings show that using BERT model embed-
dings produces results that are comparable with traditional and graph-based models,
as shown by metrics of cluster accuracy and computing runtime.