Thanks to advances in deep learning, striking results have been obtained in translation between different image modalities or spaces; e.g. using Generative Adversarial Networks, one can create a highly realistic colorized image of a black and white image, or daylight version of a nightlight image. However, existing studies generally tackle the problem in pairs and therefore, ignore the common information that are shared across different image modalities. In this thesis, a method that can create an embedding space shared by all different image modalities is proposed. The embedding space is constructed by employing pairs of modalities. Such a modality allows extracting a scene representation that is shared by all image modalities. Once learned, the space allows making zero-shot translations between two modalities for which paired data is not available. Moreover, a new modality can be easily integrated into the model easily, making it scalable.