Machine translation (MT), is the process of translating text from one source language into another target language, is one of the most important applications of NLP.
MT is a sequence-to-sequence task. There are a lot of them in NLP. We have a sequence of words in one language as an input, and we want to produce a sequence of words in another language as an output.
Other tasks can be thought of as machine translation, for example, summarization is also a sequence-to-sequence task, it is a kind of machine translation but for one language.
The traditional approach is statistical machine translation. In this post I will concentrate on neural machine translation specifically on Convolutional Neural Networks.
in 2006 Google launched the Google Translate service as a statistical machine translation service, it used United Nations and European Parliament transcripts as training data.
The first scientific paper on using neural networks in machine translation appeared in 2014, followed by a lot of advances in the following few years. It is amazing to see how fast the neural machine translation systems went from research papers to production.
Convolutional neural networks are less common for MT, despite their advantages. Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other which allows to control precisely the length of dependencies to be modeled.
Convolutional networks do not depend on the previous time step computations and therefore allow for parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence. Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers. Hierarchical structure capture long-range dependencies as compared to the chain structure modeled by recurrent networks.
Hu et al. highlighted how CNNs can be used to encode not only the semantic similarity of the translation pair, but also the context containing the phrase in the source language and thus yield a more effective translation.
As shown in figure below the convolutional sentence model (the bottom layer ) summarizes the meaning of the source sentence and target phrase, and the matching model (top layers) compares the representations using a multi-layer perceptron.
Another interesting aspect of this research is its employment of a curriculum training, where the training data is categorized from easy to difficult and uses phrase to sentence for contexts encoding for effective translations. The difficulty of the sentences is gradually increased in the training.
Meng et al. build a CNN-based framework for guiding signals from both source and target during machine translation. Instead of the straightforward convolution-pooling strategy, in which the “fusion” decisions are based on the values of feature-maps which works for tasks like classification, the authors used CNNs with gating provides guidance on which parts of source text have influence on the target words. Fusing them with entire source sentence for context yields a better joint model.
Gehring et al. use a CNN with an attention module, thus showing not only a fast performing and parallelizable implementation, but also a more effective model compared to LSTM-based ones. Using word and positional embeddings in the input representation, stacking the filters on top of each other for hierarchical mappings, gated linear units, and multi-step attention module gives the researchers a real edge on English-French and English-German translation.