Transformer Align Model - Shaojie Jiang

Notes on the paper: Jointly Learning to Align and Translate with Transformer Models.

Application Scenarios of Word Alignments in NMT

Word alignments in neural machine translation serve several purposes:

Creating bilingual lexica from parallel corpora
Supporting external dictionary-assisted translation for low-frequency words
Enabling trust, explanation, and error analysis
Preserving webpage styling during translation

Model Design

The paper leverages the observation that “attention probabilities from the penultimate layer of a normally trained transformer MT model corresponds to word alignments.” Rather than supervising all attention heads, the authors designate a single attention head in the penultimate layer as the alignment head, allowing the model flexibility in choosing whether to prioritize alignment or other attention mechanisms.

Training the Alignment Head

Two existing approaches use either labeled alignments with KL-divergence training or attentional vectors to predict target words or linguistic properties.

This work employs an unsupervised approach: an alignment model is first trained on translation, then penultimate layer attention weights are averaged and used as weak supervision for a combined translation-alignment model trained bidirectionally.

Results: While alignment performance improved, translation results remained moderate compared to previous work reporting gains from alignment supervision.