Title

EXPLOITING WORD AND SENTENCE EMBEDDINGS FOR DIVERSIFICATION IN CRAWLING AND RANKING

Abstract

Abstract

The increase in the volume of the Web and Microblogging sites caused copious amounts of duplicate or near duplicate content which emerged the diversification paradigm. On a typical search system, there are three main components which are a crawler, an indexer and a query processor. While most diversification approaches aim at the query processing stage of the search system, in this work, we aim to apply the diversification paradigm to both crawling and query processing stages. First, we introduce a diversification-aware focused crawler, which considers all the aspects of a given search query in order to construct a collection that contains equal coverage over them. Second, we focus on the diversification of short texts, such as social media posts, for the query processing stage. For both contributions, we apply well-known diversification approaches in the literature and extend the work by exploiting the neural language models that are state-of-the-art for several information retrieval and natural language processing tasks. Our experiments, in which we evaluate both approaches with well-crafted experimental setups, show that the diversification paradigm is successful for both the crawling stage and short texts. Moreover, neural language models perform comparable results for the diversification paradigm.

Supervisor(s)

Supervisor(s)

CAN DURAN UNALDI

Date and Location

Date and Location

2022-09-02 13:00:00

Category

Category

MSc_Thesis