Next Article in Journal
Path Planning Based on NURBS for Hyper-Redundant Manipulator Used in Narrow Space
Next Article in Special Issue
New Hybrid Techniques for Business Recommender Systems
Previous Article in Journal
The Effects of Display Parameters and Devices on Spatial Ability Test Times
Previous Article in Special Issue
Development of Knowledge Base Using Human Experience Semantic Network for Instructive Texts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Impact of Sentence Representation Matching in Neural Machine Translation

1
Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Korea
2
Electronics and Telecommunications Research Institute (ETRI), Gwangju 61012, Korea
3
Department of Computer Science, Jeonbuk National University, Jeonju-si 54896, Korea
4
Computer Science and Engineering, Chungnam National University, Daejeon 34134, Korea
5
Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea
*
Authors to whom correspondence should be addressed.
Submission received: 6 December 2021 / Revised: 6 January 2022 / Accepted: 19 January 2022 / Published: 26 January 2022

Abstract

:
Most neural machine translation models are implemented as a conditional language model framework composed of encoder and decoder models. This framework learns complex and long-distant dependencies, but its deep structure causes inefficiency in training. Matching vector representations of source and target sentences improves the inefficiency by shortening the depth from parameters to costs and generalizes NMTs with a different perspective to cross-entropy loss. In this paper, we propose matching methods to derive the cost based on constant word-embedding vectors of source and target sentences. To find the best method, we analyze the impact of the methods with varying structures, distance metrics, and model capacity in a French to English translation task. An optimally configured method is applied to English translation tasks from and to French, Spanish, and German. In the tasks, the method showed performance improvement by 3.23 BLEU at maximum, with an improvement of 0.71 on average. We evaluated the robustness of this method to various embedding distributions and models, such as conventional gated structures and transformer networks, and empirical results showed that it has a higher chance to improve performance in those models.

1. Introduction

Most decoders of neural machine translation (NMT) are conditional language models, which sequentially generate target words in the condition of a given source sentence. This approach is a greedy algorithm, so the dependency between sequentially selected target words may restrict the selection of the best target word composition. Beam search is a promising method to approximate the correct compositions. However, inversely, the promising results imply that NMTs are still weak regarding learning the dependency between output words in the model. This limitation in training is a fundamental barrier in learning in an end-to-end NMT model while not relying on an additional model or algorithm in inference, as proposed in various approaches [1,2]. An effective method to relax the limit is to penalize the cross-entropy by adding a sentence-level score [2,3,4] using various information generated from a decoder. Beyond the improvement of prediction quality in the works, mapping between two sentence representations has more fundamental meaning because it can provide useful information about the unseen difficulty of training in the typical NMT.
In this paper, we propose a sentence representation matching method to apply the direct mapping of sentence-level semantics to existing NMT frameworks. This method is designed for guiding training of a model with constant word vectors. To obtain a more optimized structure and deeper understanding about the guiding pressure generated by the method, we analyze its impact in the framework with varying structures, distance metrics, and model capacity determined by layer dimensions. The best model derived from the analysis is applied to practical translation tasks of English from and to the French, Spanish, German languages. Then, we analyze its robustness to various embedding distributions and model structures as long-short term memory in recurrent neural networks and transformer networks.
The contributions of this paper are that it:
  • proposes a sentence representation matching method and loss functions for training;
  • provides the impact analysis of structural factors to control guiding pressure;
  • provides a good-performing setting in many models, embeddings and translation tasks.
Section 2 shows related and background works first, then Section 3 explains motivations and details of the proposed method and proposes the factors that affect the guiding pressure to test variants of the method. Section 4 shows the experimental setting on data and model for analysis. Section 5 describes their results and provides a discussion. Section 6 and Section 7 present the conclusion and discuss future work.

2. Background

Since 2014, there has been a surge of applying recurrent neural networks with long short term memory (LSTM) [5] to machine translation after the possibility of a learning end-to-end translation model was reported [6]. After intensive development across more than two years, this approach became the state-of-the-art of machine translation and was called NMT. The remarkable research which improved the performance of NMT included the bidirectional LSTM using both forward and backward sequences [6,7,8], attention model to learn explicit alignment models [9,10], rare word modeling to estimate unknown words by an explicit model and alignment model [11], and argumentation methods to overcome a lack of data [12,13].
Those works have been made more rigorous by adopting many advanced methods such as batch normalization [14], ensemble, beam search, input feature specialization, and input feeding. Those techniques are aggregated in Google’s NMT report [15].
In 2017, transformer, a feed-forward model with a self-attention mechanism, was suggested and had a remarkable performance which was better than LSTM [16]. More recently, the very deep transformer model demonstrated higher performance than the vanilla transformer [17], and the pre-trained model also demonstrated remarkable performance [18,19,20].
Direct mapping of semantic vectors between two languages has been studied in various research directions, but the vector distribution is complex in translation tasks, so simple and successful direct mapping models have not been proposed so far. Using recently used complex NMT frameworks, the model-based approach was able to manage most of the complex relations between words. However, the direct mapping approach is still useful to understand the macroscopic similarity between two concepts. In word-level mapping models [21,22,23], the similarity between semantic vectors has been used as a dictionary to determine the most semantically related words, even though the mapping is still ambiguous for words used for various purposes. The direct mapping models can be understood as an extremely regularized model for translation, compared to current NMT frameworks, which are particularly difficult to regularize because of the high sensitivity of the parameters in the recurrent networks [24]. In this paper, we propose a safe method to use the macroscopic information to guide current NMT frameworks.
A pointer generator network is a model which uses sentence-level semantic matching [25]; it refines the probability vector to select a class from the information of the distributed representation. This network has shown promising improvement, but it is designed only for enhancing performance rather than guiding the internal hypotheses of NMT to a more natural translation model. In this paper, we focus on injecting the information of the direct mapping model involving knowledge of regularizing and also guiding the NMT.

3. Sentence Representation Matching

3.1. Motivation

Beam search is a promising method for NMTs by overcoming the problem of greedy searches in sequential target word generation. On the other hand, the impact of beam search inversely implies that the sequential decisions by the model are likely to be incorrect to select the best sentences in many cases. There are many possible causes for the inaccurate prediction of the composition of target words, such as inaccurate model representation, complex parameter landscapes, and noisy data.
A possible cause is the simple representation of the correctness of target words. In current NMTs, cross-entropy is the most popular cost function and is composed of probabilities of selecting each correct word of a target sequence. Therefore, only one variable is responsible for representing whether the selected target word is correct. Using only one variable may be risky because the second probable word and its highly probable following sequences may give higher cross-entropy than any sequences derived from the correct word selection. This case is a deceptive example of restricting accurate word composition in decoders.
Another cause is the slow parameter update in NMT structures. In LSTMs, the gradient vanishing [26] which occurs over time steps and over stacks in the vertical direction of the structures is resolved, respectively, by using memory cells and input feeding or multidimensional memories [15,27]. They are applied to the encoder and decoder, but the interface part is often a feedforward layer suffering from gradient vanishing. This vanishing limits the achievable translation quality in general and may restrict the learning of the correct composition.
Using a different type of cost function is an effective method to solve this problem. In our preliminary work [4], we evaluated the performance of applying a matching method for an English to French translation task with 1.5 million sentences and confirmed its improvement. We extended the work for more rigorous analysis to learn about how to control the hyper-parameters with respect to their impact on training.

3.2. Methods

The sentence representation matching proposed in this paper is composed of two ideas. First, this method use the output vector generated from the encoder model and passes it to the matching layer to derive the cost function. This idea reduces the distance from the cost function to an encoder model so that it can reduce the potential negative effects of gradient vanishing. Second, the cost is directly connected to the target word vectors, which is expected to induce an effective guide for the training of the encoder model.
To implement the two ideas as a single neural network added to existing NMTs, we introduce sentence representation matching, where we will call the sentence representation a concept implying the semantics of source or target sentences. An expected role of this approach is to guide NMTs not to train obviously wrong sentences in explicit direct mapping models.
The method illustrated in Figure 1 is the simplest, which will be extended in the following sections.
We newly propose the following three parts compared to NMT frameworks.

3.2.1. Vector Representation of Sentences

The proposed method matches two vector representations to evaluate the semantics of source and target sentences. The sentence representation of the source sentence is generated from the output vectors of an encoder model to induce a guiding effect to the model. The representation of the target sentence uses fixed-word vector sequences. The common assumption in building the representations is to use fixed word vectors of an imported external dictionary or using one-hot representation.
The method used in this paper is defined according to the following equations, where the representations r S of a source sentence S and r T of a target sentence T are defined as
r S = t = 1 | S | h t
r T = t = 1 | T | w t
where h t is the hidden vector generated from the top LSTM stacks at time step t in a NMT encoder and w t is the word vector at time t in a decoder. This representation extraction is not neccessarily the addition of word or output vectors over times. It can be also easily extended to general encoder-decoder models. For example, in bidirectional models, h t is replaced by h t f h t b . In bidirectional attention models, interface vectors are transformed vectors of h t with the alignment model and target word, but we can still use h t f h t b .

3.2.2. Matching Layer

In matching representations, the biggest risk is the conflict of gradients to change the vector distribution of h t between the cross-entropy and the distance, which generates bad local optima. If the scale of distance is dominated by the cross-entropy, the optima distribution will be similar to an original NMT, but otherwise, the negative phenomenon will happen. To reduce the negative effect of conflict, we added an additional linear combination layer for more flexible mapping, which can be extended to more general neural network layers.
v S = W r r S + b r
v T = r T
The matching layer is composed of parameters W r and b r for the representation matching. The generated source-side representation v S is mapped to the target-side representation v T .

3.2.3. Cost Function

In sentence-level translation using representation matching, the underlying assumption is that the semantics of a source sentence and its translation are represented as the same vector in the space for representing general semantics. In the assumption, reducing the distance of the representations is exactly the same goal as cross-entropy in NMTs. For this reason, we believe that the cost function will not generate serious side-effects in training, but will induce some positive effects, such as providing a training guide and regularization, by providing information about correct translation in different perspectives. To use the information, we set a cost function as the following equation, given model parameter θ and training set D.
L t o t a l = L c o s t + L d i s t a n c e ( v S , v T )
This method adds cost and distance without any scaling factors because the matching layer implicitly adapts its scale in updates. In early stages of the updates, the layer gives a large distance for all vectors by random initialization. but the long distance loss dominates updates and makes NMTs rapidly converge to a model to generate small distances over all vectors. Then, the impact of cross-entropy increases, and the model moves to the true optimal determined by the entropy. Therefore, if the optimal distance is sufficiently small, then this method will guide the training in early updates and preserve the true optima with respect to cross-entropy with the restriction of generating negative sentences.

3.3. Guiding Pressure

The newly introduced cost by matching sentence representations generates different gradients to cross-entropy, and then it pushes the model to move toward other directions. We call the gradients guiding pressure in this paper. This pressure is affected by many factors in matching methods because the gradients may disappear before changing the parameters of the encoder model by deep layers, large expression power of the matching layer, or loss propagated to the other layers. To understand the impact of the guiding pressure and to find the optimal configuration, we investigated potential factors of it and prepared the possible methods to control its strength.

3.3.1. Structure of Matching Layers

The depth of matching layers is a probable factor to affect the guidance strength. If the layers are deep, it causes the gradient vanishing problem, which weakens the impact of the guidance because the layers compose a feed-forward neural network. If the layers are too shallow, it restricts the expression power of the layers, and therefore, the strength of the guidance may be insufficiently increased.
In addition to the depth attribute, the responsible layer for the final cost calculation is also an attribute to affect the strength, because gradients to reduce the cost are equally contributed from source and target sides. To analyze the impact of the variation, we tested five structures, as shown in Figure 2 and Table 1.

3.3.2. Similarity Metrics

A metric to evaluate similarity between sentence representations also strongly affects the guiding pressure, because it determines points located at the same distance from the representation and gradients to move a matching representation without distance loss as well. We selected Hamming, Euclidean, and cosine distances for the distance method in (5), which have a rotated n-dimensional cube, n-dimensional sphere, and a line shape of the same distance region. We selected three types of widely used basic distance metrics, as shown in Table 2, which are defined by v s and v t of Equations (3) and () as follows.
normalized Hamming distance : | | v s v t | | n
Euclidean distance : | | v s v t | | 2
cosine similarity : v s v t | | v s | | | | v t | |

3.3.3. Model Capacity

A clear attribute to decide the guide strength is model capacity, also understood as model complexity. If model complexity is low, the model can be too generalized to represent a complex relation for matching representation. Otherwise, the model easily finds the complex relation without propagating the pressure to the encoder model. An optimal model capacity is determined by involved sentence representations varying by given data, so that empirical investigation is required. From the preliminary results, we found that the two source-side layer allows for flexibly varying the model complexity with effective guiding, so we investigated various model capacities by changing the number of hidden nodes of the matching layer from an extremely low dimension to a sufficiently large dimension, not regularizing the network at all. Although the capacity depends on the source and target dictionary size, the size and computational complexity to train occupy a tiny portion of all.

4. Experiment Setting

In our experiments, we aim to investigate the impact of the representation matching with various parameter conditions because of its complex relation to regularization caused by many factors. Then, we evaluated its performance in translation of English from and to French, Spanish, and German.

4.1. Settings for Impact Analysis

To build a training set, we merged the Europarl parallel corpus and the Common-crawl corpus released from WMT-14 (http://www.statmt.org/wmt14/, accessed on 18 January 2022). We applied tokenizing and lowercasing and limited the length of each sentence to a maximum value of 40 tokens through using scripts provided by a machine translation package, MOSES [28] (http://www.statmt.org/moses/, accessed on 18 January 2022). A starting and an ending symbol are attached to each source sentence. We used news-commentary-v8 and newstest-2013 sets released with the training set. Data statistics are shown in Table 3. For validation, we randomly selected 10% of the sentences of the training set.
We extracted word vectors from the training set using a language model implemented in word2vec [29] (https://code.google.com/archive/p/word2vec/, accessed on 18 January 2022) for all language pairs. The number of tokens in each dictionary is composed of the most frequently observed 40,000 tokens and their vector representations. In the training, validation, and test phases of a language pair, the same dictionary is imported.
We built a bidirectional model and passed the h and c from the forward to backward pass of the encoder. Then, the h t of the forward pass and h | S | t are concatenated to derive r s . The used attention model is equal to [9], except for additionally passing c for the initialization of the decoder. The input word vectors are fed on to the second-shallowest LSTM stack of the encoder. To boost converging speed, we applied batch normalization through the weighted average of the original and normalized vectors. The weight is decayed by multiplying 0.8 at each epoch, which becomes almost 0 after 16 epochs. The representation matching is only applied to the training phase. Details of the model settings are shown in Table 4.
For the impact analysis of the guiding pressure, we extended this base structure with respect to the structure, model capacity, and distance metrics in English-French translation as shown in Table 3.

4.2. Setting for Robustness Analysis

In experiments to investigate the robustness of the model structure and embedding distribution, we reproduced a NMT open-source program (https://github.com/OpenNMT/OpenNMT-py, accessed on 18 January 2022) [30] and evaluated the performance of LSTM and the transformer with random, word2vec, and Bidirectional Encoder Representations from Transformers (BERT) [31] embeddings for the French to English translation task. We used the re-sized Europarl corpus for training (https://www.statmt.org/europarl/v7/fr-en.tgz, accessed on 18 January 2022), common-test for validation (http://www.statmt.org/europarl/v1/common-test2.tgz, accessed on 18 January 2022), newstest (http://data.statmt.org/wmt17/translation-task/test.tgz, accessed on 18 January 2022) and news-commentary (http://matrix.statmt.org/test_sets/nc-test2007.tgz, accessed on 18 January 2022) for tests. Applied preprocessing methods are equal. The number of tokens are 50,000. Special symbols, including the start, end, unknown, and blank symbols, remained in the corpus. Data statistics for this robustness analysis are shown in Table 5.
The tested embedding methods included random generation (RE), importing embedded vectors trained by word2vec (https://github.com/Andras7/word2vec-pytorch, accessed on 18 January 2022), implemented as open-source (WV), and importing embedded vectors of pre-trained multilingual BERT (BE). RE selects an element of an embedding vector from a uniform distribution in [−0.1, 0.1], as shown in Table 6. WV is generated as in other experiments through learning a language model from the parallel training corpus with the same dictionary. BE is extracted from a pre-trained multilingual BERT [31] model (https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip, accessed on 18 January 2022) consisting of an embedder and a number of encoders. RE, WV, and BE are fixed embedded vectors in every experiment.
To compare the effects on widely used NMT architecture, we selected LSTM [10] with a unidirectional encoder and transformer [16]. In the transformer, the matching network receives the final output vector of the encoder-transformer as its input and calculates its distance to the sum of word-embedding vectors of the decoder-transformer. Detailed model parameters are shown in Table 7. The setting of the matching layer is equal to the other experiments using source-side-2-layers, 250 hidden dimensions, and normalized hamming distance.

5. Results and Discussions

To evaluate the impact on performance, we investigated token-level precision and BLEU scores [32]. In addition to these scores, we evaluated layer-wise statistics of a neural network to understand the impact on model complexity according to the following definition:
μ L = 1 | L | l L | | W l i n × l o u t ( l ) | | l i n l o u t
σ L = 1 | L | l L | | W l i n × l o u t ( l ) | | 2 2 l i n l o u t μ l 2
where L is a neural network layer and W l i n × l o u t ( l ) is its weight parameter matrix with l i n and l o u t dimensions. The metrics μ L and σ L are the layer-wise mean and standard deviation of the network, and μ l is the mean for the specific layer l. These two metrics are expected to show a degree of dispersion of hyperplanes represented by a layer in Cartesian and polar coordinates. Compared to the gradient, the change of the metrics is more focused on evaluating model complexity at each update step rather than the shape of movement of a model in optimization.

5.1. Impact Analysis

5.1.1. Structure

In Figure 3, the token-level precision results of each structure are shown. One-layer structures have a point in the graph because of the fixed dimension for the distance calculation. Two-layer structures are evaluated over varying hidden layer dimensions. The use of 1 or 2 layers showed no significant difference in source- and target-side structures. The both-sides structure showed less precision than the source-side structure.
Figure 4 shows the layer-wise mean and standard deviation of absolute weight parameters of encoder and decoder models in training. In the cases of the mean, the difference between structures is small because of the very large number of parameters. Thus, the small change may be induced by significantly large changes of some parameters. In the zoomed boxes of the subgraph (a), the mean values of the source-side layers were higher than the no-matching case and the target-side layers in the early epochs, although they were lower after sufficient training. In the subgraph (b), source-side layers maintained the mean value of the decoder models, but target-side layers showed significantly larger values. Subgraph (c) and (d) showed the similar superiority of source-side layers as the mean cases.
In the overall results, source-side layers provided higher pressure at the early stage, but caused less conflict at the final stage of training compared to target-side layers. In the early stage, the gradient for the matching cost is split into the encoder and the matching layer in target-side layers, while source-side layers use the full gradient for training the encoder and increase training speed. In the final stage, the matching mechanism restricts the training of the correct output vector distribution of the encoder, so that the decoder models need to train more information and therefore increase the model complexity. In this stage, source-side layers are more flexible regarding changing the output distribution, because a neural network layer is a many-to-one mapping between the input and output. Any movement of the output distribution is regarded as the change of the input in the source-side layers, but it pushes the output vectors of the matching layer in target-side layers. Thus, target-side layers are more reluctant to change and restrict the encoder to move toward the correct distribution for minimizing the translation cost. In sum,
  • Source-side layers have a higher impact on improving performance;
  • Source-side layers have stronger pressure and less conflict.

5.1.2. Similarity Metrics

We evaluated the impact of similarity metrics in the source-side two layers case, as in Figure 5.
In the results of Figure 5, the Hamming distance was slightly better than the cosine distance, and Euclidean distance showed much worse performance than the others. In Figure 6, the Hamming distance showed larger mean and STD than the cosine distance in all models and epochs. The case of Euclidean distance is excluded by extremely large values unfit to the scale to show the difference of other distance results. In the results, the impact of Euclidean distance on the pressure is the largest, but it seems to be stronger than the required pressure, as it restricts the performance. The Hamming and cosine distances are more stable, generating relatively weak pressure to preserve the model obtained by cross-entropy. The Hamming distance has stronger pressure than the cosine distance.

5.1.3. Model Capacity

The impact of the model capacity is shown in Figure 3, which includes the precision results of source, target, and both-sides 2 layers by changing the dimension of hidden layers.
Both-sides layers showed little change by increasing the dimension. The results of source-side layers slightly decreased the precision, but the target-side layers increased. In Figure 7, detailed results of the degree of dispersion are shown for all distance types with various dimensions. In the overall results, no simple correlation between hidden dimensions and the degrees of dispersion is found.

5.1.4. Performance of Best Matching Layer

In the French to English task used for the impact analysis, the grid search results are shown in the highlighted table and are drawn in Figure 8. The table shows the most achievable performance for validation and test sets, which are not affected by model selection based on validation performance. The best case in the table is that of the source-side 2 layer, 250 hidden dimension, and Hamming distance. We applied this setting for other translation tasks as an extension because of the large cost of the grid search.
From the impact investigation, we found the best settings to use included batch normalization, 2 source-side layers, 250 dimensions of hidden layers, and Hamming distance. With these settings, we performed translation tasks for three language pairs as shown in Table 8. In this table, we evaluated two different performances to show the robust result of the model selection process. One is the best performance in each data set evaluated at every epoch without a model selection process. The other is the best performance in each data set evaluated with the selected best validation model. The expected performance for all metrics and data sets using the matching method improves performance. In a more detailed view, the method improves the translation quality for the English to French, Spanish, and German pairs and the French to English pair. A notable point is that the precision in the training set is improved together when validation and test results are improved in the best validation model.
The result implies that the matching method is effective to improve the translation quality. It is more effective in improving the quality from relatively simpler language to more complex language in terms of tense and gender. A probable reason for the improvement is that the probability of generating the target sentences in those pairs is ambiguous in some cases, so the additional information given by the matching method clarifies the tokens to select. The performance improvement in the training and test sets of the best validation models is distinguished compared to the usual regularization methods, which increase the test accuracy by reducing the training accuracy.

5.2. Robustness Analysis

Table 9 shows the change in performance which occurred by applying the matching method. The bold scores indicate the cases that showed improvement. In the overall embedding and model settings, at least half of the test cases increased the scores. Precision and BLEU were improved up to 0.99 and 0.94. Transformer showed better performance than LSTM, but LSTM showed more cases that improved their performance. In the comparison of results between embedding methods, BE showed critically lower performance than RE and WV embedding.
In the observation, we could confirm that the matching method has more of a chance to cause positive effects in various models and embedding methods. The better performance of transformer compared to LSTM is consistently observed in current NMT literature. The benefit of the matching method is likely to be less positive in the transformer. This difference may be caused by different densities of information in the sentence representation generated by the encoder-transformer. The performance of the BE method is seriously worse than the others, but it is because of the big difference in the unknown word rate. The rate of commonly used tokens is a maximum of 9.36% in BE, while the others showed values as high as 48.32%, as shown in Table 10. The scale difference of input elements in embedded vectors was not a cause, as shown in Table 6.
Figure 9 shows three examples of robust analysis tests. Most translations are semantically almost correct translations in comparison to their reference sentence. However, the styles of models before and after applying matching are distinguished as red and blue-colored texts. Even if the semantic meaning is almost equivalent, the matching method more strongly follows the translation of the reference sentence. This is because the matching method has the role of restricting the translation to not generate a completely syntactic form in addition to the original translation models.

5.3. Performance on State-of-the-Art Model

Table 11 shows the BLEU score on the BiBERT model with and without our matching method. We downloaded the IWSLT’14 D e E n data from Pytorch (https://github.com/pytorch/fairseq/blob/main/examples/translation/prepare-iwslt14.sh, accessed on 18 January 2022) and the code from BiBERT github link (https://github.com/fe1ixxu/BiBERT, accessed on 18 January 2022). Only the matching method was added, and all other progress of training were the same as baseline. The improvement of performance is observed in every condition, including the one-way, dual training, and fine-tuning conditions, except E n D e with one-way training.

6. Conclusions

In this paper, we raised the issue of inefficiency in training the encoder of NMTs implemented as a conditional language model. To relax the limit, we introduced sentence representation matching to force the representations of a source and its corresponding target sentence to be closely located by adding their distance to a loss function. The impact analysis showed that source-side layers are more effective in training with lower conflict, and Hamming distance has stronger pressure than cosine distance. In the grid search, a 2-layer source-side structure with 250 hidden dimensions and Hamming distance showed the best performance in French to English translation. When translating language pairs between English, Spanish and German, this setting slightly improved translation quality. In a more generalized environment using transformer and the various embedding method, importing constant vectors from explicit resources, the matching method was slightly but more likely to increase translation performance. Sentence representation matching has specific patterns with respect to structure, distance, and capacity, but the best setting was in somewhat intermediate states. For this reason, it requires grid search to apply this method for more general applications, and the best setting found in this paper can provide a good initial point to search.

7. Future Work

The simplest approach to control the guiding pressure is to use balancing parameters to learn the scale of the distance. The approach can effectively change the guiding pressure, but it is still under the limitation determined by the architectural factors discussed in this paper. Their combination will provide more fine control of hyperparameters.

Author Contributions

Conceptualization, H.J. and K.K.; methodology, H.J. and K.K.; software, H.J. and K.K.; validation, H.J. and K.K. formal analysis, H.J. and K.K.; H.J. and K.K.; resources, H.J. and K.K.; data curation, H.J. and K.K.; writing—original draft preparation, H.J., S.W. and K.K.; writing—review and editing, J.-H.S., S.-H.N., S.J. and S.W.; visualization, S.W.; supervision, K.K.; project administration, K.K.; funding acquisition, K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (R7119-16-1001, Core technology development of the real-time simultaneous speech translation based on knowledge enhancement) and supported by a Global University Project(GUP) grant funded by the GIST in 2019.

Data Availability Statement

Not applicable.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. Sequence level training with recurrent neural networks. arXiv 2015, arXiv:1511.06732. [Google Scholar]
  2. Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. Minimum risk training for neural machine translation. arXiv 2015, arXiv:1512.02433. [Google Scholar]
  3. Jung, S.; Lee, J.; Kim, J. Learning to Embed Semantic Correspondence for Natural Language Understanding. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 131–140. [Google Scholar] [CrossRef]
  4. Kim, K.; Shin, J.H.; Na, S.H.; Jung, S. Concept Equalization to Guide Correct Training of Neural Machine Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 27 November–1 December 2017; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2017; pp. 302–307. [Google Scholar]
  5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  6. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Montreal, QC, Canada, 2014; pp. 3104–3112. [Google Scholar]
  7. Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
  8. Sundermeyer, M.; Alkhouli, T.; Wuebker, J.; Ney, H. Translation modeling with bidirectional recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 14–25. [Google Scholar]
  9. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  10. Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
  11. Luong, M.T.; Sutskever, I.; Le, Q.V.; Vinyals, O.; Zaremba, W. Addressing the rare word problem in neural machine translation. arXiv 2014, arXiv:1410.8206. [Google Scholar]
  12. He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.Y.; Ma, W.Y. Dual learning for machine translation. Adv. Neural Inf. Process. Syst. 2016, 29, 820–828. [Google Scholar]
  13. Ahmadnia, B.; Dorr, B.J. Augmenting neural machine translation through round-trip training approach. Open Comput. Sci. 2019, 9, 268–278. [Google Scholar] [CrossRef]
  14. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  15. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: San Francisco, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
  17. Liu, X.; Duh, K.; Liu, L.; Gao, J. Very deep transformers for neural machine translation. arXiv 2020, arXiv:2008.07772. [Google Scholar]
  18. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
  19. Lin, Z.; Pan, X.; Wang, M.; Qiu, X.; Feng, J.; Zhou, H.; Li, L. Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information. arXiv 2020, arXiv:2010.03142. [Google Scholar]
  20. Xu, H.; Van Durme, B.; Murray, K. BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation. arXiv 2021, arXiv:2109.04588. [Google Scholar]
  21. AP, S.C.; Lauly, S.; Larochelle, H.; Khapra, M.; Ravindran, B.; Raykar, V.C.; Saha, A. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 1853–1861. [Google Scholar]
  22. Luong, T.; Pham, H.; Manning, C.D. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 5 June 2015; pp. 151–159. [Google Scholar]
  23. Upadhyay, S.; Faruqui, M.; Dyer, C.; Roth, D. Cross-lingual models of word embeddings: An empirical comparison. arXiv 2016, arXiv:1604.00425. [Google Scholar]
  24. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
  25. Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Montreal, QC, Canada, 2015; pp. 2692–2700. [Google Scholar]
  26. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
  27. Kalchbrenner, N.; Danihelka, I.; Graves, A. Grid long short-term memory. arXiv 2015, arXiv:1507.01526. [Google Scholar]
  28. Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, 25–27 June 2007; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 177–180. [Google Scholar]
  29. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  30. Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; Rush, A. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of the ACL 2017, System Demonstrations, Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 67–72. [Google Scholar]
  31. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  32. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Figure 1. Sentence representation matching method plugged in to typical neural machine translation (red and dashed line: typical model).
Figure 1. Sentence representation matching method plugged in to typical neural machine translation (red and dashed line: typical model).
Applsci 12 01313 g001
Figure 2. Layer structure for sentence representation matching.
Figure 2. Layer structure for sentence representation matching.
Applsci 12 01313 g002
Figure 3. Token-level precision by dimensions of hidden layer for structure types.
Figure 3. Token-level precision by dimensions of hidden layer for structure types.
Applsci 12 01313 g003
Figure 4. Layer-wise statistics μ L and σ L of weight parameters for a matching layer structure in training (cosine distance was used and the results were averaged over all hidden dimension settings).
Figure 4. Layer-wise statistics μ L and σ L of weight parameters for a matching layer structure in training (cosine distance was used and the results were averaged over all hidden dimension settings).
Applsci 12 01313 g004
Figure 5. Token-level precision by dimensions of hidden layers for distance metrics.
Figure 5. Token-level precision by dimensions of hidden layers for distance metrics.
Applsci 12 01313 g005
Figure 6. Layer-wise statistics μ L and σ L of weight parameters for distance types in training (cosine distance was used, and the results were averaged over all hidden dimension settings).
Figure 6. Layer-wise statistics μ L and σ L of weight parameters for distance types in training (cosine distance was used, and the results were averaged over all hidden dimension settings).
Applsci 12 01313 g006
Figure 7. Layer-wise STD σ L of 2 source-side layers by hidden dimension settings.
Figure 7. Layer-wise STD σ L of 2 source-side layers by hidden dimension settings.
Applsci 12 01313 g007
Figure 8. Grid search results of best achievable performance for French to English translation (red: maximum, blue: minimum, white: mean, s1: source-side 1 layer, s2: source-side 2 layers, t1: target-side 1 layer, t2: target-side 2 layers, b1: both-sides 1 layer).
Figure 8. Grid search results of best achievable performance for French to English translation (red: maximum, blue: minimum, white: mean, s1: source-side 1 layer, s2: source-side 2 layers, t1: target-side 1 layer, t2: target-side 2 layers, b1: both-sides 1 layer).
Applsci 12 01313 g008
Figure 9. Examples of translation from French to English in robustness analysis.
Figure 9. Examples of translation from French to English in robustness analysis.
Applsci 12 01313 g009
Table 1. Definition of v s and v t for layer structures.
Table 1. Definition of v s and v t for layer structures.
Structure v S v T
Source-side 1 layer W r r S + b r (Equation (3)) r T (Equation (4))
Source-side 2 layers W r 2 sigm ( W r 1 r S + b r 1 ) + b r 2 r T
Target-side 1 layer r S W r r T + b r
Target-side 2 layers r S W r 2 sigm ( W r 1 r T + b r 1 ) + b r 2
Both-sides 1 layer W r r S + b r W r r T + b r
Table 2. Hyperparameters for grid search of the best sentence matching network in impact analysis.
Table 2. Hyperparameters for grid search of the best sentence matching network in impact analysis.
StructureSource-side 2 layers,
Source-side 1 layer,
Target-side 2 layers,
Target-side 1 layer,
Both-sides 1 layer
MetricHamming, Euclidean, cosine
Model capacityHidden nodes: { 10 , 50 , 100 , 250 , 500 }
Table 3. Data statistics for impact analysis (train: training set; test1: newstest-2013; test2: newscommentary-v8; validation set is a randomly selected 10% of the training set).
Table 3. Data statistics for impact analysis (train: training set; test1: newstest-2013; test2: newscommentary-v8; validation set is a randomly selected 10% of the training set).
TypeSetEnFrEnEsEnDEUnit
sentencetrain4.13.03.6 10 6
test1262326532741 10 0
test2117,861138,408151,139 10 0
tokentrain83.892.262.965.676.472.6 10 6
test148.053.848.051.752.651.4 10 3
test22.52.93.03.33.33.4 10 6
Table 4. Model settings for impact analysis (M: million, dim.: dimension).
Table 4. Model settings for impact analysis (M: million, dim.: dimension).
LSTM stacks4parameter
Cells per stacks1000   encoder3.05 M
Dim. of word50   decoder3.10 M
Dim. of attention250   output11 M
Batch size128   interface0.19 M
Table 5. Data Statistics for robustness analysis in French to English translation (train: Europarl-v7, validation: commontest, test1: newstest-2014, test2: newscommentary-2007).
Table 5. Data Statistics for robustness analysis in French to English translation (train: Europarl-v7, validation: commontest, test1: newstest-2014, test2: newscommentary-2007).
CorpusSentence (Fr-En)Token (Fr)Token (En)
Train1,737,35544,201,33440,094,199
Valid22,960746,023650,469
Test13,00381,19171,114
Test22,00758,68249,690
Table 6. Statistics of embedding vectors for robustness analysis.
Table 6. Statistics of embedding vectors for robustness analysis.
Element Activation
EmbeddingUnitLanguageMinMaxMeanStd
WVtokensource−0.86740.8900−0.00030.0641
WVtokentarget−0.88190.88240.00180.0609
REtokensource−0.10000.10000.00000.0577
REtokentarget−0.10000.10000.00000.0577
BEwordsource + target−0.89110.4352−0.00730.0463
BEpositionsource + target−0.86430.41000.00000.0150
BEtokensource + target−0.30170.23260.00040.0178
Euclidean Distance
EmbeddingUnitLanguageMinMaxMeanStd
WVtokensource0.02485.93271.04261.0090
WVtokentarget0.02436.33140.88611.0553
REtokensource1.18101.42021.30610.0259
REtokentarget1.19711.40401.30610.0257
BEwordsource + target0.67531.86411.28800.1630
BEpositionsource + target0.35091.31600.41260.0465
BEtokensource + target0.48460.50310.49390.0092
Table 7. Model settings for robustness analysis (M: million, voca.: vocabulary).
Table 7. Model settings for robustness analysis (M: million, voca.: vocabulary).
LSTMTransformer
parameterRE & WVBERE & WVBE
layers4462
number of heads--88
dimension of
   layer output10001000--
   embedding512768512768
   attention1000100010001000
   model [16]--512768
   feed forward [16]--20482048
total parameters168 M373 M120 M303 M
batch size64326432
voca. of encoder50,002119,54750,002119,547
voca. of decoder50,004119,54750,004119,547
Table 8. Performance changes after sentence representation matching for various language pairs ( δ : performance of models using the matching cost subtracted by original cost).
Table 8. Performance changes after sentence representation matching for various language pairs ( δ : performance of models using the matching cost subtracted by original cost).
Best Performance in Each Set
ValidTest1Test2
DataModelPrec.BLEUPrec.BLEUPrec.BLEU
E [ δ ] 0.320.330.230.320.050.30
En → FrNMT implementation of [9]19.2527.6518.4523.7415.5724.08
NMT + matching20.0928.3119.6224.8316.2024.74
Fr → EnNMT implementation of [9]28.3627.9722.9622.4021.1423.03
NMT + matching28.2528.4722.8822.8021.3223.53
En → EsNMT implementation of [9]26.5531.1420.4822.4324.2130.55
NMT + matching27.0631.2420.3022.3023.1330.28
Es → EnNMT implementation of [9]30.4132.3723.2422.3925.0930.31
NMT + matching29.9832.3222.8422.4925.8230.60
En → DeNMT implementation of [9]22.7817.7318.5513.8616.6914.82
NMT + matching24.3219.2220.5315.1717.9216.31
De → EnNMT implementation of [9]23.3323.5920.0518.3118.4420.31
NMT + matching22.8822.8618.9317.4417.0319.43
Best Performance of Best Validation Model
TrainTest1Test2
DataModelPrec.Prec.BLEUPrec.BLEU
E [ δ ] 0.990.250.720.270.69
En → FrNMT implementation of [9]65.8917.9922.8815.2623.64
NMT + matching64.0919.6224.4116.2024.74
Fr → EnNMT implementation of [9]63.2222.5621.6817.9821.94
NMT + matching65.0422.4822.5117.9422.59
En → EsNMT implementation of [9]63.2119.6620.9621.2728.41
NMT + matching65.6320.1622.0922.5029.96
Es → EnNMT implementation of [9]65.4222.2321.4722.4228.62
NMT + matching63.8320.1820.2420.3927.24
En → DeNMT implementation of [9]54.8517.4612.0713.6112.58
NMT + matching61.2420.5314.8717.0015.81
De → EnNMT implementation of [9]60.7519.8817.5917.1819.68
NMT + matching59.4618.3116.8415.2818.68
Table 9. Performance in robustness analysis (French to English translation, newstest: newstest-2014, nc-test:newscommentary-2007).
Table 9. Performance in robustness analysis (French to English translation, newstest: newstest-2014, nc-test:newscommentary-2007).
Best Performance in Each Set
Common-TestNewstestNc-Test
EmbeddingNetworksprec.BLEUprec.BLEUprec.BLEU
RandomLSTM5.7210.9614.0821.3016.7327.29
EmbeddingLSTM + matching5.8211.1014.2521.3016.1926.74
(RE)Transformer6.0911.6716.2425.8117.4429.19
Transformer + matching6.0911.6116.2824.9618.1329.32
Word2VecLSTM5.7711.1214.5521.7516.6726.57
(WV)LSTM + matching5.9511.4514.7321.4815.8226.78
Transformer6.1211.4416.3023.5017.7628.25
Transformer + matching6.1011.3316.0724.6018.7028.46
BERTLSTM4.557.5410.4012.6210.9014.21
EmbeddingLSTM + matching4.667.6211.0112.8711.0614.41
(BE)Transformer2.862.495.262.646.193.54
Transformer + matching3.301.866.252.366.972.74
Table 10. Overlapping rate of sords and tokens between imported embedding dictionary and data sets for robustness analysis.
Table 10. Overlapping rate of sords and tokens between imported embedding dictionary and data sets for robustness analysis.
DatasetSentenceRE & WV (50,000 Tokens)BE (119,547 Tokens)
WordTokenWordToken
FrEnFrEnFrEnFrEn
Europarl1,737,35538.9548.3299.7199.847.019.3578.0389.40
common-test22,96087.5389.4999.5199.6317.8831.8579.4089.86
newstest300381.8082.3995.9296.0830.4445.3374.8783.07
nc-test200792.1792.8798.6798.6531.8948.3475.2684.05
Table 11. Comparison of dual-directional and ordinary (one-way) translation models in [20] including stochastic layer selection ( K = 8 ) with and without sentence representation mapping on IWSLT’14 D e E n .
Table 11. Comparison of dual-directional and ordinary (one-way) translation models in [20] including stochastic layer selection ( K = 8 ) with and without sentence representation mapping on IWSLT’14 D e E n .
MethodDe EnEn De
Baseline [20]+ MatchingBaseline [20]+ Matching
One-Way (vocab size = 12 K)37.6938.1330.0029.93
Dual-Directional Training38.3738.4230.3030.50
+ Fine-Tuning38.6138.7030.4530.53
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jung, H.; Kim, K.; Shin, J.-H.; Na, S.-H.; Jung, S.; Woo, S. Impact of Sentence Representation Matching in Neural Machine Translation. Appl. Sci. 2022, 12, 1313. https://0-doi-org.brum.beds.ac.uk/10.3390/app12031313

AMA Style

Jung H, Kim K, Shin J-H, Na S-H, Jung S, Woo S. Impact of Sentence Representation Matching in Neural Machine Translation. Applied Sciences. 2022; 12(3):1313. https://0-doi-org.brum.beds.ac.uk/10.3390/app12031313

Chicago/Turabian Style

Jung, Heeseung, Kangil Kim, Jong-Hun Shin, Seung-Hoon Na, Sangkeun Jung, and Sangmin Woo. 2022. "Impact of Sentence Representation Matching in Neural Machine Translation" Applied Sciences 12, no. 3: 1313. https://0-doi-org.brum.beds.ac.uk/10.3390/app12031313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop