Jointly Learning to Align and Translate with Transformer Models
Built on top of Transformer-XL, Compressive Transformer1 condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them. This model is suitable for long-range sequence learning but may cause too much computational burden for tasks that only have short sequences.
What characterizes a easier to train, easier to generalize neural model?
My notes for the paper: Adaptive Computation Time for Recurrent Neural Networks1.
Additive vs multiplicative halting probability Multiplicative: In the paper (footnote 1), the authors discuss throughly their considerations for deciding the computation time.
This is a growing list of pointers to useful blog posts and papers related to transformers.
Transformers explained Blog: The Illustrated Transformer has many intuitive animations of how transformer models work Blog: Universal Transformers introduces the idea of recurrence among layers Blog: Transformer vs RNN and CNN for Translation Task GNNs: similarities and differences Blog: Transformers are Graph Neural Networks bridges transformer models and Graph Neural Networks Transformer improvements Blog: DeepMind Releases a New Architecture and a New Dataset to Improve Long-Term Memory in Deep Learning Systems Nural Turing Machine + transformer?
In this post, I will try to understand what makes XLNet better than BERT.