Jointly Learning to Align and Translate with Transformer Models
Built on top of Transformer-XL, Compressive Transformer1 condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them. This model is suitable for long-range sequence learning but may cause too much computational burden for tasks that only have short sequences.
What characterizes a easier to train, easier to generalize neural model?
My notes for the paper: Adaptive Computation Time for Recurrent Neural Networks1.
Additive vs multiplicative halting probability Multiplicative: In the paper (footnote 1), the authors discuss throughly their considerations for deciding the computation time.