paper reading notes

Transformer Align Model

Jointly Learning to Align and Translate with Transformer Models

Compressive Transformers

Built on top of Transformer-XL, Compressive Transformer1 condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them. This model is suitable for long-range sequence learning but may cause too much computational burden for tasks that only have short sequences.

Visualizing the Loss Landscape of Neural Nets

What characterizes a easier to train, easier to generalize neural model?

Adaptive Computation Time

My notes for the paper: Adaptive Computation Time for Recurrent Neural Networks1. Additive vs multiplicative halting probability Multiplicative: In the paper (footnote 1), the authors discuss throughly their considerations for deciding the computation time.