Abstract
Abstract
This thesis introduces REIGN (Refurbished Embeddings with Integrated Guidance Networks), a novel framework for efficient representation learning of long-form textual documents. Unlike traditional Transformer-based approaches limited by token length constraints, REIGN leverages a hierarchical strategy where pre-trained Guidance Networks (GNs) generate fixed-size chunk embeddings. These chunk-level embeddings are then processed by a lightweight encoder trained using a contrastive learning objective inspired by SimCLR. This decoupled design enables semantic understanding of documents containing hundreds of thousands of tokens without relying on subword tokenization or end-to-end backpropagation through large models. REIGN is benchmarked on synthetic datasets containing long-context documents, demonstrating strong performance in document-level semantic retrieval while maintaining computational efficiency and scalability. Additionally, a caching mechanism is proposed to precompute and reuse GN embeddings, significantly accelerating training and reducing memory overhead. This architecture also results in substantially faster training and fine-tuning by offloading heavy computation to the pre-encoding stage, making REIGN particularly well-suited for resource-constrained or iterative experimentation settings.