Transformer Align Model

Jointly Learning to Align and Translate with Transformer Models

Compressive Transformers

Built on top of Transformer-XL, Compressive Transformer1 condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them. This model is suitable for long-range sequence learning but may cause too much computational burden for tasks that only have short sequences.

What's New in XLNet?

In this post, I will try to understand what makes XLNet better than BERT.