Next Article in Journal
Spatiotemporal Dynamics of a Generalized Viral Infection Model with Distributed Delays and CTL Immune Response
Next Article in Special Issue
From Complex System Analysis to Pattern Recognition: Experimental Assessment of an Unsupervised Feature Extraction Method Based on the Relevance Index Metrics
Previous Article in Journal
Thermal Prediction of Convective-Radiative Porous Fin Heatsink of Functionally Graded Material Using Adomian Decomposition Method
Previous Article in Special Issue
Development of Simple-To-Use Predictive Models to Determine Thermal Properties of Fe2O3/Water-Ethylene Glycol Nanofluid
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Invertible Autoencoder for Domain Adaptation

Department of Electrical and Computer Engineering, NYU Tandon School of Engineering, 5 MetroTech Center, Brooklyn, NY 11201, USA
*
Authors to whom correspondence should be addressed.
Submission received: 17 December 2018 / Revised: 8 March 2019 / Accepted: 21 March 2019 / Published: 27 March 2019
(This article belongs to the Special Issue Machine Learning for Computational Science and Engineering)

Abstract

:
The unsupervised image-to-image translation aims at finding a mapping between the source ( A ) and target ( B ) image domains, where in many applications aligned image pairs are not available at training. This is an ill-posed learning problem since it requires inferring the joint probability distribution from marginals. Joint learning of coupled mappings F A B : A B and F B A : B A is commonly used by the state-of-the-art methods, like CycleGAN to learn this translation by introducing cycle consistency requirement to the learning problem, i.e., F A B ( F B A ( B ) ) B and F B A ( F A B ( A ) ) A . Cycle consistency enforces the preservation of the mutual information between input and translated images. However, it does not explicitly enforce F B A to be an inverse operation to F A B . We propose a new deep architecture that we call invertible autoencoder (InvAuto) to explicitly enforce this relation. This is done by forcing an encoder to be an inverted version of the decoder, where corresponding layers perform opposite mappings and share parameters. The mappings are constrained to be orthonormal. The resulting architecture leads to the reduction of the number of trainable parameters (up to 2 times). We present image translation results on benchmark datasets and demonstrate state-of-the art performance of our approach. Finally, we test the proposed domain adaptation method on the task of road video conversion. We demonstrate that the videos converted with InvAuto have high quality and show that the NVIDIA neural-network-based end-to-end learning system for autonomous driving, known as PilotNet, trained on real road videos performs well when tested on the converted ones.

1. Introduction

Inter-domain translation problem of converting an instance, e.g., image or video, from one domain to another is applicable to a wide variety of learning tasks, including object detection and recognition, image categorization, sentiment analysis, action recognition, speech recognition, and more. High-quality domain translators ensure that an arbitrary learning model trained on the samples from the source domain, can perform well when tested on the translated samples (Similarly, an arbitrary learning model trained on the translated samples should perform well on the samples from the target domain. Training in this framework is however, much more computationally expensive). The translation problem can be posed in the supervised learning framework, e.g., [1,2], where the learner has access to corresponding pairs of instances from both domains, or unsupervised learning framework, e.g., [3,4], where no such paired instances are available. This paper focuses on the latter case, which is more difficult but at the same time more realistic as acquiring the dataset of paired images is often impossible in practice.
The unsupervised domain adaptation is typically solved using generative adversarial networks (GAN) framework [5]. GANs constitute a family of methods that learn generative models from complicated real-world data. In order to teach the generator to synthesize semantically meaningful data from standard signal distributions, GANs train a discriminator to distinguish real samples in the training dataset from fake samples synthesized by the generator. The generator aims to deceive the discriminator by producing increasingly more realistic samples. Thus, the generator and discriminator play an adversarial game, during which the generator learns to produce samples from the desired data distribution and the discriminator eventually cannot make a better decision than randomly guessing whether a particular sample is fake or real. GANs have recently been successfully applied to image generation [6,7,8,9], image editing [1,3,10,11], video prediction [12,13,14], and many other tasks [15,16,17]. In the domain adaptation setting, the generator performs domain translation and is trained to learn the mapping from the source to the target domain and the discriminator is trained to discriminate between original images from the target domain and those provided by the generator. In this setting, the generator usually has the structure of the autoencoder. The two most common state-of-the-art domain adaptation approaches, CycleGAN [3] and UNIT [4], are built on this basic approach. CycleGAN addresses the problem of adaptation from domain A to domain B by training two translation networks, where one realizes the mapping F A B and the other realizes F B A . The cycle consistency loss ensures the correlation between input image and the corresponding translation. In particular, to achieve cycle consistency, CycleGAN trains two autoencoders, where each minimizes its own adversarial loss and they both jointly minimize
F A B ( F B A ( B ) ) B 2 2 and F B A ( F A B ( A ) ) A 2 2 .
Cycle consistency loss is also incorporated into the recent implementations of UNIT. It is implicitly assumed that the model will learn the mappings F A B and F B A in such a way that F A B = F B A 1 , however, it is not explicitly imposed. Consider a simple example. Assume the first autonecoder is a two-layer linear multi-layer perceptron (MLP) where the weight matrix of the first layer (encoder) is denoted as E 1 and the weight matrix of the second layer (decoder) is denoted as D 1 . Thus, for an input x A A it outputs y B ( x A ) = D 1 E 1 x A . The second autoencoder then is a two-layer MLP with encoder weight matrix E 2 and decoder weight matrix D 2 that for an input data point x B should produce output y A ( x B ) = D 2 E 2 x B . To satisfy cycle consistency requirement, the following should hold: y A ( y B ( x A ) ) = ( x A ) and y B ( y A ( x B ) ) = ( x B ) . These two conditions are equivalent to D 2 E 2 D 1 E 1 = I and D 1 E 1 D 2 E 2 = I . This holds for example when D 1 = E 2 1 and D 2 = E 1 1 .
In contrast to this approach, we implicitly require F A B = F B A 1 . Thus, in the context of the given simple example, we correlate encoders and decoders to satisfy inversion conditions D 1 = E 2 1 and D 2 = E 1 1 . We avoid performing prohibitive inversions of large matrices and instead guarantee these conditions to hold through two steps: (i) introducing shared parametrization of encoder E 2 and decoder D 1 such that D 1 = E 2 ( E 1 and D 2 is treated similarly) and (ii) appropriate training to achieve orthonormality E 2 = E 2 1 and E 1 = E 1 1 , i.e., we train autoencoder ( E 2 , D 1 ) to satisfy D 1 E 2 x B = x B for arbitrary input x B and autoencoder ( E 1 , D 2 ) to satisfy D 2 E 1 x A = x A for arbitrary input x A . Since the encoder and decoder are coupled as given in (i), such training leads to satisfying inversion conditions. Practical networks contain linear and non-linear transformations. We therefore propose specific architectures, which are invertible.
Figure 1 (see also its extended version, Figure A4, in the Appendix A) and Figure 2 illustrate the basic idea behind InvAuto. The plots were obtained by training a single autoencoder ( E , D ) to reconstruct its input. InvAuto has shared weights satisfying D = E and inverted non-linearities and clearly obtains matrix D E that is the closest to identity compared to other methods, i.e., vanilla autoencoder (Auto), autoencoder with cycle consistency (Cycle), and variational autoencoder (VAE) [18]. Note also that, at the same time, InvAuto requires half of the number of trainable parameters. This is because the encoder and decoder use the same parameters.
This paper is organized as follows: Section 2 reviews the literature, Section 3 explains InvAuto in details, Section 4 explains how to apply InvAuto to domain adaptation, Section 5 demonstrates experimental verification of the proposed approach, and Section 6 provides conclusions.

2. Related Work

Unsupervised image-to-image translation models were developed to tackle domain adaptation problem with unpaired datasets. A plethora of existing approaches utilize autoencoders trained in the GAN framework, where autoencoder serves as a generator, for this learning problem. This includes approaches based on conditional GAN [2,19] and methods introducing additional components to the loss function forcing partial cycle consistency [20]. Another approach [21] introduces two coupled GANs, where each generator is an autoencoder and the coupling is obtained by sharing a subset of weights between autoencoders as well as between discriminators. This technique was later extended to utilize variational autoencoders as generators [4]. The resulting approach is commonly known as UNIT. CycleGAN presents yet another way of addressing the image-to-image translation by specific training scheme that preserves the mutual information between input and translated images [22]. Both UNIT and CycleGAN constitute the most popular choices for performing image-to-image translation.
There also exist other learning tasks that can be viewed as instances of image-to-image translation problem. Among them, notable approaches focus on style transfer [23,24,25,26]. They aim at preserving the content of the input image while altering its style to mimic the style of the images from the target domain. This goal is achieved by introducing content and style loss functions that are jointly optimized. Finally, inverse problems, such as super-resolution, also fall into the category of image-to-image translation problems [27].

3. Invertible Autoencoder

Here we explain the details of the architecture of InvAuto. The architecture needs to be symmetric to allow invertibility, e.g., the layers should be arranged as ( T 1 , T 2 , , T M encoder E , T M 1 , T M 1 1 , , T 1 1 ) decoder D , where T 1 , T 2 , , T M denote subsequent transformations of the signal that is being propagated through the network (M is the total number of those) and T 1 1 , T 2 1 , , T M 1 denote their inversions. Thus, the architecture is inverted layer by layer, where any layer of the encoder has its mirror inverted counterpart in the decoder. The autoencoder is trained to reconstruct its input. Below we explain how to invert different types of layers of the deep model.

3.1. Fully Connected Layer

Consider transformation T E of an input signal performed by an arbitrary fully connected layer of an encoder E parametrized with weight matrix W. Let x denote layer’s input and y denote its output. Thus,
T E : y = W x .
An inverse operation is then defined as
( T E ) 1 : x = W 1 y ,
We parametrize the counterpart layer of the decoder with a transpose of W, Thus, the considered encoder and decoder layers will share parametrization. Therefore, we enforce the counterpart decoder’s layer to perform transformation:
T D : x = W y .
By training the autoencoder to reconstruct its input on its output we will enforce orthonormality W 1 = W and Thus, equivalence of transformations ( T E ) 1 and T D , i.e., ( T E ) 1 T D .

3.2. Convolutional Layer

Consider transformation T E of an input image performed by an arbitrary convolutional layer of an encoder E. Let x denote this layer’s vectorized input image and y denote corresponding output. 2D convolution can be implemented using matrix multiplication involving a Toeplitz matrix [28]. Toeplitz matrix is obtained from the set of kernels of the 2D convolutional filters. Thus, transformation T E and its inverse ( T E ) 1 can be explained with the same equations as the ones used before, Equations (2) and (3), however, now W is a Toeplitz matrix. We will again parametrize the counterpart layer of the decoder with a transpose of a Toeplitz matrix W. The transpose of the Toeplitz matrix is in practice obtained by copying weights from the considered convolutional layer to the counterpart decoder’s layer that is implemented as a transposed convolutional layer (also known as a deconvolutional layer). Therefore, as before, we enforce the counterpart decoder’s layer to perform transformation T D : x = W y and by appropriate training ensure ( T E ) 1 T D .

3.3. Activation Function

Invertible activation function should be a bijection. In this paper, we consider a modified LeakyReLU activation function σ and use only this non-linearity in the model. Consider transformation T E of an input signal performed by this non-linearity applied in the encoder E. This non-linearity is defined as
T E : y = σ ( x ) = 1 α x , if x 0 α x , otherwise .
An inverse operation is then defined as
( T E ) 1 : x = σ 1 ( y ) = α y , if x 0 1 α y , otherwise .
The corresponding non-linearity in the decoder will therefore realize the operation of an inverted modified LeakyReLU given in Equation (6). In the experiments we set α = 2 .

3.4. Residual Block

Consider transformation T E of an input signal performed by a residual block [29] of an encoder E. We modify the residual block to remove the internal non-linearity as given in Figure 3a. The residual block is parametrized with weight matrices W 1 and W 2 . Those are Toeplitz matrices corresponding to the convolutional and transposed convolutional layers of the residual block. Let x denote this block’s vectorized input and y denote its corresponding output. Thus, transformation T E is defined as
T E : y = σ ( ( W 2 · W 1 + I ) · x )
An inverse operation is then defined as
( T E ) 1 : x = ( W 2 · W 1 + I ) 1 σ 1 ( y ) .
We will parametrize the counterpart residual block of the decoder with a transpose of matrix W 2 · W 1 + I as given in Figure 3b. Therefore we enforce the counterpart decoder’s residual block to perform transformation:
T D : x = ( W 1 W 2 + I ) y .
As before, at training will enforce orthonormality ( W 2 · W 1 + I ) 1 = ( W 2 · W 1 + I ) and Thus, ( T E ) 1 T D .

3.5. Bias

We consider bias as a separate layer in the network. Then, handling biases is straightforward. In particular, the layer in the encoder that performs bias addition has its counterpart layer in the decoder, where the same bias is subtracted.

3.6. Experimental Validation of Orthonormality

In this section, we validate the concept of InvAuto. The goal of this section is to show that proposed shared parametrization and training enforce orthonormality and that at the same time the orthonormality property is not organically achieved by standard architectures. We compare InvAuto with previously mentioned vanilla autoencoder, autoencoder with cycle consistency, and variational autoencoder. We experimented with various datasets (MNIST and CIFAR-10) and architectures (MLP, convolutional (Conv), and ResNet). All the networks were designed to have two down-sampling layers and two up-sampling layers. Encoder’s matrix E and decoder’s matrix D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively.
We test orthonormality by reporting the histograms of the cosine similarity of each pair of rows of matrix E for all methods (Figure 4) along with their mean and standard deviation (Table 1) as we expect the cosine similarity to be close to 0 for InvAuto. We then show the 2 -norm of the rows of E as we expect the rows of InvAuto to have close-to-unit norm (Table 2). InvAuto enforces the encoder, and consequently the decoder, to be orthonormal. Other methods do not explicitly demand that and Thus, the orthonormality of their encoders is weaker. This observation is further confirmed by Figure 1 and Figure 2 shown before in the Introduction. In the Appendix A, we provide three more figures that complement Figure 2 (recall that the latter reports the MSE of D E I ). They show the MSE of the diagonal (Figure A1) and off-diagonal of D E I (Figure A2) as well as the ratio of the MSE of the off-diagonal and diagonal of D E (Figure A3) for various methods. The reconstruction loss obtained for all methods is also shown in Appendix A (Table A1).
Next we describe how InvAuto is applied to the problem of domain adaptation.

4. Invertible Autoencoder for Domain Adaptation

For the purpose of performing domain adaptation, we construct the dedicated architecture that is similar to CycleGAN, but we use InvAuto at the feature level of the generators. This InvAuto contains encoder E and decoder D that themselves have the form of autoencoders. Each of these internal autoencoders is used to do the conversion between the features corresponding to two different domains, and thus, the encoder E performs the conversion from the features corresponding to domain A into the features corresponding to domain B . The decoder D, on the other hand, performs the conversion from the features corresponding to domain B into the features corresponding to domain A . Since E and D form InvAuto, E realizes an inversion of D (and vice versa) and shares parameters with D. This introduces strong correlations between two generators and reduces the number of trainable parameters, which distinguishes our approach from CycleGAN. The proposed architecture is illustrated in Figure 5. The details of the architecture and training are provided in Appendix A.
Next we describe the cost function that we use to train our deep model. The first component of the cost function is the adversarial loss [5], i.e.,
L adv ( Gen A , Dis A ) = E x A p d ( A ) [ log Dis A ( x A ) ] + E x B p d ( B ) [ log ( 1 Dis A ( Gen A ( x B ) ) ) ] L adv ( Gen B , Dis B ) = E x B p d ( B ) [ log Dis B ( x B ) ] + E x A p d ( A ) [ log ( 1 Dis B ( Gen B ( x A ) ) ) ] ,
where p d ( A ) and p d ( B ) denote the distribution of data from A and B , respectively.
The second component of the loss function is the cycle consistency loss defined as
L c c ( Gen A , Gen B ) = E x A p d ( A ) [ x A Gen A ( Gen B ( x A ) ) 1 ] + E x B p d ( B ) [ x B Gen B ( Gen A ( x B ) ) 1 ] .
The objective function that we minimize therefore becomes
L ( Gen A , Gen B , Dis A , Dis B ) = λ L c c ( Gen A , Gen B ) + L adv ( Gen A , Dis A ) + L adv ( Gen B , Dis B ) ,
where λ controls the balance between the adversarial loss and cycle consistency loss. The cycle consistency loss enforces the orthonormality property of InvAuto.

5. Experiments

We next demonstrate experiments on domain adaptation problems. We compare our model against UNIT [4] and CycleGAN [3]. We used publicly available implementations of both methods available from https://github.com/mingyuliutw/UNIT/ and https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/. The details of our architecture and the training process are summarized in Appendix A.

5.1. Experiments with Benchmark Datasets

We considered the following domain adaptation tasks:
(i)
Day-to-night and night-to-day image conversion: we used unpaired road pictures recorded during the day and at night obtained from KAIST dataset [30].
(ii)
Day-to-thermal and thermal-to-day image conversion: we used road pictures recorded during the day with a regular camera and a thermal camera obtained from KAIST dataset [30].
(iii)
Maps-to-satellite and satellite-to-maps: we used satellite images and maps obtained from Google Maps [1].
The datasets for the last two tasks, i.e., (ii) and (iii), are originally paired, however, we randomly permuted them and train the model in an unsupervised fashion. The training and testing images were furthermore resized to 128 × 128 resolution.
The visual results of image conversion are presented in Figure 6, Figure 7 and Figure 8 (Appendix A contains the same figures in higher resolution). We see that InvAuto visually performs comparably to other state-of-the-art methods.
To evaluate the performance of the methods numerically we use the following approach:
  • For the tasks (ii) and (iii), we directly calculated the 1 loss between the converted images and the ground truth.
  • For the task (i), we trained two autoencoders Ω A and Ω B on both domains, i.e., we trained each of them to perform high-quality reconstruction of the images from its own domain and low-quality reconstruction of the images from the other domain. Then we use these two autoencoders to evaluate the quality of the converted images, where high 1 reconstruction loss of the autoencoder for the images converted to resemble those from its corresponding domain implies low-quality image translation.
Table 3 contains the results of the numerical evaluation and shows that the performance of InvAuto is similar to the state-of-the-art techniques that we compare InvAuto with and is furthermore contained within the performance range established by the CycleGAN (best performer) and UNIT (consistently slightly worst from CycleGAN).

5.2. Experiments with Autonomous Driving System

To test the quality of the image-to-image translations obtained by InvAuto, we use the NVIDIA evaluation system for autonomous driving described in detail in [31]. The system evaluates the performance of an already trained NVIDIA neural-network-based end-to-end learning platform for autonomous driving (PilotNet) on a test video using a simulator for autonomous driving. The system uses the following performance metrics for evaluation: autonomy, position precision, and comfort. We do not describe these metrics as they are described well in the mentioned paper. We only emphasize that these metrics are expressed as a percentage, where 100 % corresponds to the best performance. We collected the high-resolution videos of the same road during the day and night from the camera inside the car. Each video had ∼45 K frames. The pictures were resized to 512 × 512 resolution for the conversion and then resized back to the original size of 1920 × 1208 . We used our domain translator as well as CycleGAN to convert the collected day video to a night video and also the collected night video to a day video (Figure 9). To evaluate our model, we used the aforementioned NVIDIA evaluation system, where the converted videos where used as testing sets for this system. We report results in Table 4.
The PilotNet model used for testing was trained mostly on day videos. Thus, it is expected to perform worse on night videos. Therefore, the performance for original night video is worse than for the same video converted to a day video in terms of autonomy and position precision. The comfort deteriorates due to the inconsistency of consecutive frames in the converted video, i.e., the videos are converted frame-by-frame and we do not apply any post-processing to ensure smooth transition between frames. The results for InvAuto and CycleGAN are comparable.

6. Conclusions

We proposed a novel architecture that we call invertible autoencoder, which, as opposed to the common deep learning architectures, allows the layers of the model performing opposite operations (like encoder and decoder) to share weights. This is achieved by enforcing orthonormal mappings in the layers of the model. We demonstrate the applicability of the proposed architecture to the problem of domain adaptation and evaluate it on benchmark datasets and an autonomous driving task. The performance of the proposed approach matches state-of-the-art methods and requires less trainable parameters.

Author Contributions

Y.T. is the lead author of this work. A.C. provided project supervision.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Invertible Autoencoder for Domain Adaptation

Figure A1. Comparison of the MSE of the diagonal of D E I for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.
Figure A1. Comparison of the MSE of the diagonal of D E I for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.
Computation 07 00020 g0a1
Figure A2. Comparison of the MSE of the off-diagonal of D E I for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR dataset.
Figure A2. Comparison of the MSE of the off-diagonal of D E I for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR dataset.
Computation 07 00020 g0a2
Figure A3. Comparison of the ratio of MSE of the off-diagonal and diagonal of D E for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.
Figure A3. Comparison of the ratio of MSE of the off-diagonal and diagonal of D E for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.
Computation 07 00020 g0a3
Table A1. Test reconstruction loss (MSE) for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. VAE has significantly higher reconstruction loss by construction.
Table A1. Test reconstruction loss (MSE) for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. VAE has significantly higher reconstruction loss by construction.
Dataset and ModelInvAutoAutoCycleVAE
MNIST MLP 0.189 0.100 0.112 1.245
MNIST Conv 0.168 0.051 0.057 1.412
CIFAR Conv 0.236 0.126 0.195 1.457
CIFAR ResNet 0.032 0.127 0.217 0.964
Figure A4. Heatmap of the values of matrix D E for InvAuto, (a,e,i,m) Auto (b,f,j,n), Cycle (c,g,k,o), and VAE (d,h,l,p) on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively. In case of InvAuto, D E is the closest to the identity matrix.
Figure A4. Heatmap of the values of matrix D E for InvAuto, (a,e,i,m) Auto (b,f,j,n), Cycle (c,g,k,o), and VAE (d,h,l,p) on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively. In case of InvAuto, D E is the closest to the identity matrix.
Computation 07 00020 g0a4
  • Additional Experimental Results for Section 5
Figure A5. Day-to-night image conversion.
Figure A5. Day-to-night image conversion.
Computation 07 00020 g0a5
Figure A6. Night-to-day image conversion.
Figure A6. Night-to-day image conversion.
Computation 07 00020 g0a6
Figure A7. Day-to-thermal image conversion.
Figure A7. Day-to-thermal image conversion.
Computation 07 00020 g0a7
Figure A8. Thermal-to-day image conversion.
Figure A8. Thermal-to-day image conversion.
Computation 07 00020 g0a8
Figure A9. Maps-to-satellite image conversion.
Figure A9. Maps-to-satellite image conversion.
Computation 07 00020 g0a9
Figure A10. Satellite-to-maps image conversion.
Figure A10. Satellite-to-maps image conversion.
Computation 07 00020 g0a10
Figure A11. Experimental results with autonomous driving system: day-to-night conversion.
Figure A11. Experimental results with autonomous driving system: day-to-night conversion.
Computation 07 00020 g0a11
Figure A12. Experimental results with autonomous driving system: night-to-day conversion.
Figure A12. Experimental results with autonomous driving system: night-to-day conversion.
Computation 07 00020 g0a12
  • Invertible Autoencoder for Domain Adaptation: Architecture and Training
Generator architecture Our implementation of InvAuto contains 18 invertible residual blocks for both 128 × 128 and 512 × 512 images, where 9 blocks are used in the encoder and the remaining in the decoder. All layers in the decoder are the inverted versions of encoder’s layers. We furthermore add two down-sampling layers and two up-sampling layers for the model trained on 128 × 128 images, and three down-sampling layers and three up-sampling layers for the model trained on 512 × 512 images. The details of the generator’s architecture are listed in Table A3 and Table A4. For convenience, we use Conv to denote convolutional layer, ConvNormReLU to denote Convolutional-InstanceNorm-LeakyReLU layer, InvRes to denote invertible residual block, and Tanh to denote hyperbolic tangent activation function. The negative slope of LeakyReLU function is set to 0.2 . All filters are square and we have the following notations: K represents filter size and F represents the number of output feature maps. The paddings are added correspondingly.
Discriminator architecture We use similar discriminator architecture as PatchGAN [1]. It is described in Table A2. We use this architecture for training both on 128 × 128 and 512 × 512 images.
Criterion and Optimization At training, we set λ = 10 and use l 1 loss for the cycle consistency in Equation (12). We use Adam optimizer [32] with learning rate l r = 0.0002, β 1 = 0.5 and β 2 = 0.999 . We also add l 2 penalty with weight 10 6 .
Table A2. Discriminator for both 128 × 128 and 512 × 512 images.
Table A2. Discriminator for both 128 × 128 and 512 × 512 images.
NameStrideFilter
ConvNormReLU2 × 2K4-F64
ConvNormReLU2 × 2K4-F128
ConvNormReLU2 × 2K4-F256
ConvNormReLU1 × 1K4-F512
Conv1 × 1K4-F1
Table A3. Generator for 128 × 128 images.
Table A3. Generator for 128 × 128 images.
NameStrideFilter
ConvNormReLU1 × 1K7-F64
ConvNormReLU2 × 2K3-F128
ConvNormReLU2 × 2K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
InvRes1 × 1K3-F256
ConvNormReLU1/2 × 1/2K3-F128
ConvNormReLU1/2 × 1/2K3-F64
Conv1 × 1K7-F3
Tanh
Table A4. Generator for 512 × 512 images.
Table A4. Generator for 512 × 512 images.
NameStrideFilter
ConvNormReLU1 × 1K7-F64
ConvNormReLU2 × 2K3-F128
ConvNormReLU2 × 2K3-F256
ConvNormReLU2 × 2K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
InvRes1 × 1K3-F512
ConvNormReLU1/2 × 1/2K3-F256
ConvNormReLU1/2 × 1/2K3-F128
ConvNormReLU1/2 × 1/2K3-F64
Conv1 × 1K7-F3
Tanh

References

  1. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  2. Wang, T.; Liu, M.; Zhu, J.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2017. [Google Scholar]
  3. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
  4. Liu, M.; Breuel, T.; Kautz, J. Unsupervised Image-to-Image Translation Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  5. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  6. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  7. Nguyen, A.; Yosinski, J.; Bengio, Y.; Dosovitskiy, A.; Clune, J. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  8. Gan, Z.; Chen, L.; Wang, W.; Pu, Y.; Zhang, Y.; Liu, H.; Li, C.; Carin, L. Triangle Generative Adversarial Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  9. Zhang, H.; Xu, T.; Li, H. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
  10. Wang, C.; Wang, C.; Xu, C.; Tao, D. Tag Disentangled Generative Adversarial Network for Object Image Re-rendering. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
  11. Wang, W.; Huang, Q.; You, S.; Yang, C.; Neumann, U. Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
  12. Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating Videos with Scene Dynamics. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  13. Finn, C.; Goodfellow, I.; Levine, S. Unsupervised Learning for Physical Interaction Through Video Prediction. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  14. Vondrick, C.; Torralba, A. Generating the Future with Adversarial Transformers. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  15. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text to Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016. [Google Scholar]
  16. Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
  17. Lu, J.; Kannan, A.; Yang, J.; Parikh, D.; Batra, D. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  18. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR2014), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  19. Dong, H.; Neekhara, P.; Wu, C.; Guo, Y. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. arXiv 2017, arXiv:1701.02676. [Google Scholar]
  20. Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised Cross-Domain Image Generation. arXiv 2016, arXiv:1611.02200. [Google Scholar]
  21. Liu, M.Y.; Tuzel, O. Coupled Generative Adversarial Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  22. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), Helsinki, Finland, 5–9 July 2008. [Google Scholar]
  23. Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  24. Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
  25. Ulyanov, D.; Lebedev, V.; Vedaldi, A.; Lempitsky, V.S. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016. [Google Scholar]
  26. Gatys, L.A.; Bethge, M.; Hertzmann, A.; Shechtman, E. Preserving Color in Neural Artistic Style Transfer. arXiv 2016, arXiv:1606.05897. [Google Scholar]
  27. McCann, M.T.; Jin, K.H.; Unser, M. Convolutional Neural Networks for Inverse Problems in Imaging: A Review. IEEE Signal Process. Mag. 2017, 34, 85–95. [Google Scholar] [CrossRef]
  28. Vasudevan, A.; Anderson, A.; Gregg, D. Parallel Multi Channel convolution using General Matrix Multiplication. In Proceedings of the 28th Annual IEEE International Conference on Application-specific Systems, Architectures and Processors, Seattle, WA, USA, 10–12 July 2017. [Google Scholar]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  30. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–10 June 2015. [Google Scholar]
  31. Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to End Learning for Self-Driving Cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
  32. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Figure 1. Heatmap of the values of matrix D E for InvAuto (a,e), Auto (b,f), Cycle (c,g), and VAE (d,h) on MLP and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of multi-layer encoder and decoder, respectively, e.g., E = E L E 2 E 1 and D = D L D 2 D 1 for a 2 L -layer autoencoder. In case of InvAuto, D E is the closest to the identity matrix.
Figure 1. Heatmap of the values of matrix D E for InvAuto (a,e), Auto (b,f), Cycle (c,g), and VAE (d,h) on MLP and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of multi-layer encoder and decoder, respectively, e.g., E = E L E 2 E 1 and D = D L D 2 D 1 for a 2 L -layer autoencoder. In case of InvAuto, D E is the closest to the identity matrix.
Computation 07 00020 g001
Figure 2. Comparison of the mean squared error (MSE) MSE ( D E I ) for InvAuto, Auto, Cycle, and VAE on MLP, convolutional, and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively.
Figure 2. Comparison of the mean squared error (MSE) MSE ( D E I ) for InvAuto, Auto, Cycle, and VAE on MLP, convolutional, and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively.
Computation 07 00020 g002
Figure 3. (a) Residual block. (b) Inverted residual block.
Figure 3. (a) Residual block. (b) Inverted residual block.
Computation 07 00020 g003
Figure 4. The histograms of cosine similarity of the rows of E for InvAuto (a,e), Auto (b,f), Cycle (c,g), and VAE (d,h) on MLP and ResNet architectures and MNIST and CIFAR datasets.
Figure 4. The histograms of cosine similarity of the rows of E for InvAuto (a,e), Auto (b,f), Cycle (c,g), and VAE (d,h) on MLP and ResNet architectures and MNIST and CIFAR datasets.
Computation 07 00020 g004
Figure 5. The architecture of the domain translator with InvAuto ( E , D ) . x A A and x B B are the inputs of the translator. y B is a converted image x A into the B domain and y A is a converted image x B into the A domain. Invertible autoencoder ( E , D ) is built of encoder E and decoder D, where each of those itself is an autoencoder. Enc A , Enc B are feature extractors, and Dec A , Dec B are the final layers of the generators Gen B , i.e., ( Enc A , E , Dec B ), and Gen A , i.e., ( Enc B , D , Dec A ), respectively. Discriminators Dis A and Dis B discriminate whether their input comes from the generator (True) or original dataset (False).
Figure 5. The architecture of the domain translator with InvAuto ( E , D ) . x A A and x B B are the inputs of the translator. y B is a converted image x A into the B domain and y A is a converted image x B into the A domain. Invertible autoencoder ( E , D ) is built of encoder E and decoder D, where each of those itself is an autoencoder. Enc A , Enc B are feature extractors, and Dec A , Dec B are the final layers of the generators Gen B , i.e., ( Enc A , E , Dec B ), and Gen A , i.e., ( Enc B , D , Dec A ), respectively. Discriminators Dis A and Dis B discriminate whether their input comes from the generator (True) or original dataset (False).
Computation 07 00020 g005
Figure 6. (Left) Day-to-night image conversion. Zoomed image is shown in Figure A5 in Appendix A. (Right) Night-to-day image conversion. Zoomed image is shown in Figure A6 in Appendix A.
Figure 6. (Left) Day-to-night image conversion. Zoomed image is shown in Figure A5 in Appendix A. (Right) Night-to-day image conversion. Zoomed image is shown in Figure A6 in Appendix A.
Computation 07 00020 g006
Figure 7. (Left) Day-to-thermal image conversion. Zoomed image is shown in Figure A7 in Appendix A. (Right) Thermal-to-day image conversion. Zoomed image is shown in Figure A8 in Appendix A.
Figure 7. (Left) Day-to-thermal image conversion. Zoomed image is shown in Figure A7 in Appendix A. (Right) Thermal-to-day image conversion. Zoomed image is shown in Figure A8 in Appendix A.
Computation 07 00020 g007
Figure 8. (Left) Maps-to-satellite image conversion. Zoomed image is shown in Figure A9 in Appendix A. (Right) Satellite-to-maps image conversion. Zoomed image is shown in Figure A10 in Appendix A.
Figure 8. (Left) Maps-to-satellite image conversion. Zoomed image is shown in Figure A9 in Appendix A. (Right) Satellite-to-maps image conversion. Zoomed image is shown in Figure A10 in Appendix A.
Computation 07 00020 g008
Figure 9. (Left) Experimental results with autonomous driving system: day-to-night conversion. Zoomed image is shown in Figure A11 in Appendix A. (Right) Experimental results with autonomous driving system: night-to-day conversion. Zoomed image is shown in Figure A12 in Appendix A.
Figure 9. (Left) Experimental results with autonomous driving system: day-to-night conversion. Zoomed image is shown in Figure A11 in Appendix A. (Right) Experimental results with autonomous driving system: night-to-day conversion. Zoomed image is shown in Figure A12 in Appendix A.
Computation 07 00020 g009
Table 1. Mean and standard deviation of cosine similarity of rows of E. InvAuto achieves cosine similarity that is the closest to 0. Best performer is in bold.
Table 1. Mean and standard deviation of cosine similarity of rows of E. InvAuto achieves cosine similarity that is the closest to 0. Best performer is in bold.
Dataset and ModelInvAutoAutoCycleVAE
MNIST 0.001 0.008 0.007 0.001
MLP ± 0.118 ± 0.210 ± 0.207 ± 0.219
MNIST 0.001 0.001 0.001 0.001
Conv ± 0.148 ± 0.179 ± 0.176 ± 0.190
CIFAR 0.001 0.002 0.004 0.003
Conv ± 0.145 ± 0.176 ± 0.195 ± 0.268
CIFAR 0.000 0.000 0.000 0.001
ResNet ± 0.134 ± 0.203 ± 0.232 ± 0.298
Table 2. Mean and standard deviation of the 2 -norm of the rows of E. InvAuto achieves the 2 -norm of the rows that is the closest to the unit norm. Best performer is in bold.
Table 2. Mean and standard deviation of the 2 -norm of the rows of E. InvAuto achieves the 2 -norm of the rows that is the closest to the unit norm. Best performer is in bold.
Dataset and ModelInvAutoAutoCycleVAE
MNIST 0.976 1.326 1.268 1.832
MLP ± 0.190 ± 0.095 ± 0.095 ± 0.501
MNIST 0.905 1.699 1.780 1.971
Conv ± 0.321 ± 0.732 ± 0.779 ± 0.794
CIFAR 0.908 3.027 2.463 1.176
Conv ± 0.219 ± 0.816 ± 0.688 ± 0.356
CIFAR 0.868 2.890 2.650 1.728
ResNet ± 0.078 ± 0.895 ± 0.937 ± 0.311
Table 3. Numerical evaluation of CycleGAN, UNIT, and InvAuto with 1 reconstruction loss.
Table 3. Numerical evaluation of CycleGAN, UNIT, and InvAuto with 1 reconstruction loss.
Methods
TasksCycleGANUNITInvAuto
Night-to-day 0.033 0.227 0.062
Day-to-nigth 0.041 0.114 0.067
Thermal-to-day 0.287 0.339 0.299
Day-to-thermal 0.179 0.194 0.205
Maps-to-satellite 0.261 0.331 0.272
Satellite-to-maps 0.069 0.104 0.080
Table 4. Experimental results with autonomous driving system: autonomy, position precision, and comfort.
Table 4. Experimental results with autonomous driving system: autonomy, position precision, and comfort.
Video TypeAutonomyPosition PrecisionComfort
Original day 99.6 % 73.3 % 89.7 %
Original night 98.6 % 63.1 % 86.3 %
Day-to-night 99.0 % 69.6 % 83.2 %
InvAuto
Night-to-day 99.3 % 68.0 % 84.7 %
InvAuto
Day-to-night 99.0 % 68.4 % 84.7 %
CycleGAN
Night-to-day 98.8 % 64.0 % 87.3 %

Share and Cite

MDPI and ACS Style

Teng, Y.; Choromanska, A. Invertible Autoencoder for Domain Adaptation. Computation 2019, 7, 20. https://0-doi-org.brum.beds.ac.uk/10.3390/computation7020020

AMA Style

Teng Y, Choromanska A. Invertible Autoencoder for Domain Adaptation. Computation. 2019; 7(2):20. https://0-doi-org.brum.beds.ac.uk/10.3390/computation7020020

Chicago/Turabian Style

Teng, Yunfei, and Anna Choromanska. 2019. "Invertible Autoencoder for Domain Adaptation" Computation 7, no. 2: 20. https://0-doi-org.brum.beds.ac.uk/10.3390/computation7020020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop