Compressive Transformers

Notes on the paper: Compressive Transformers for Long-Range Sequence Modelling.

The Compressive Transformer “condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them.” It suits long-range sequence learning but may be computationally expensive for shorter sequences.

Background

The concept connects to human cognition: humans use lossy compression when memorizing. Three approaches compared:

RNNs: Compress memories into fixed-size vectors (space-efficient but hard to parallelize)
Transformers: Store all past memories uncompressed (better performance but increasing computational/memory costs)
Sparse attention: Reduces computation while maintaining spatial costs

Model Design and Training

The Compressive Transformer uses attention mechanisms over both memories and compressed memories. Training employs local auxiliary losses to avoid long backpropagation-through-time, with a clever approach: rather than reconstructing original memory vectors, the model reconstructs attention vectors instead.

Practical Concerns

Auxiliary loss trains only the compression module to prevent harming the main network
Batch accumulation (4x larger batches) improves performance
Gradient norms clipped to 0.1 for stability
Convolution works best for memory compression

Further Questions

Two open questions: why do Compressive Transformers improve rare word modeling, and could we explore adaptive compression rates or using RNNs as compressors?