Abstract
Abstract
In this thesis, we present a novel solution to the early detection of fake news problem on emerging topics through weak supervision. Traditional techniques rely on fact-checkers or supervised learning with labeled data, which is not readily available for emerging topics. To address this, we introduce end-to-end Weakly Supervised Text Classification framework, WeSTeC, to programmatically label a large-scale text dataset of a particular domain and train supervised text classifiers with the assigned labels. The proposed framework combines multiple weak labeling strategies and aggregates the generated weak labels into a single weak label per data instance. The generated labels are then used to fine tune a pre-trained RoBERTa classifier for fake news detection. By using the weakly labeled dataset containing fake news related to the emerging topic, the trained fake news detection model becomes specialized for the topic at hand. We consider both semi-supervision and domain adaptation setups, utilizing small amounts of labeled data and labeled data from other domains respectively. The proposed model is evaluated on both the quality of aggregated weak labels generated and the fake news detection classifier. In both evaluations, the model outperforms all baselines in each setup considered. In addition, when compared to the fully supervised counterpart, the fake news detection model trained on weak labels achieves an accuracy as close as 1\%, showing the effectiveness of the weak labeling module of the proposed framework.