Text generation from visual data is a problem often studied using deep learning, having a wide range of applications. This thesis focuses on two different aspects of this problem by proposing both supervised and unsupervised methods to solve it.
In the first part of the thesis, we work on referring expression comprehension and generation from videos. We specifically work with relational referring expressions which we define to be expressions that describe an object with respect to another object. For this, we first collect a novel dataset of referring expressions and videos where there are multiple copies of the same object, making relational referring expressions necessary to describe them. Moreover, we train two baseline deep networks on this dataset, which show promising results. Finally, we propose a deep attention network that significantly outperforms the baselines on our dataset.
In the second part of the thesis, we tackle the problem we solved in the first part in an unsupervised way. Models that generate text from videos or images tend to be supervised, which means that there needs to be corresponding textual description for every visual example in the datasets they use. However, collecting such paired data is a costly task and much of the data we have is not labeled. As the lack of data was one of the bottlenecks in the supervised part of our thesis, in this part we consider the same problem in an unsupervised setting. For this, we adapt the CycleGAN architecture by Zhu \etal to be between the visual and text domains. Moreover, we use this architecture to perform experiments on different video and image captioning datasets, for some of which we achieve promising results.