Transformer Align Model

Jointly Learning to Align and Translate with Transformer Models

Compressive Transformers

Built on top of Transformer-XL, Compressive Transformer1 condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them. This model is suitable for long-range sequence learning but may cause too much computational burden for tasks that only have short sequences.

Visualizing the Loss Landscape of Neural Nets

What characterizes a easier to train, easier to generalize neural model?

Adaptive Computation Time

My notes for the paper: Adaptive Computation Time for Recurrent Neural Networks1. Additive vs multiplicative halting probability Multiplicative: In the paper (footnote 1), the authors discuss throughly their considerations for deciding the computation time.

A Hub for Transformer Blogs and Papers

This is a growing list of pointers to useful blog posts and papers related to transformers. Transformers explained Blog: The Illustrated Transformer has many intuitive animations of how transformer models work Blog: Universal Transformers introduces the idea of recurrence among layers Blog: Transformer vs RNN and CNN for Translation Task GNNs: similarities and differences Blog: Transformers are Graph Neural Networks bridges transformer models and Graph Neural Networks Transformer improvements Blog: DeepMind Releases a New Architecture and a New Dataset to Improve Long-Term Memory in Deep Learning Systems Nural Turing Machine + transformer?

What's New in XLNet?

In this post, I will try to understand what makes XLNet better than BERT.