Transformer Align Model

In this paper1, transformer is trained to perform both translation and alignment tasks.

Application scenarios of word alignments in NMT

  • Generating bilingual lexica from parallel corpora
  • External dictionary assisted translation to improve translation of low frequency words
  • Trust, explanation, error analysis
  • Preserving style on webpages

Model design

The attention mechanism has long been motivated by word alignments in statistical machine translation, but ensure the alignment quality, additional supervision is needed.

There is a tendency that the attention probabilities from the penultimate layer of a normally trained transformer MT model corresponds to word alignments. Therefore, one attention head (clever!) in the penultimate layer is trained as the alignment head. The motivation of selecting only one attention head for alignment is to give the freedom to the model of choosing whether to rely more on the alignment or other attention heads.

How two train the alignment head

There are two approaches existing in the literature:

  • Label alignments beforehand and train the attention weights through KL-divergence.
  • Use the attentional vector to also predict either the target word or the properties such as POS tags of the target tokens.

In this work, an unsupervised training approach is used to train the alignment head. An alignment model is first trained on translation, then the penultimate layer attention weights are averaged and used as weak alignment supervision for a translation (and alignment) model. The alignment model is trained in both directions.

Previous work reported performance gain by introducing alignment supervision. In this paper, however, alignment performances are good, but translation results are moderate.


  1. Jointly Learning to Align and Translate with Transformer Models ↩︎

PhD student at ILPS

Opinions are mine. Leave a comment below if you want to discuss the content.

comments powered by Disqus