Compressive Transformers
Notes on Compressive Transformers — a model that condenses old memories and stores them in compressed memory buffers for long-range sequence learning.
Notes on the paper: Compressive Transformers for Long-Range Sequence Modelling.
The Compressive Transformer “condenses old memories (hidden states) and stores them in the compressed memory buffer, before completely discarding them.” It suits long-range sequence learning but may be computationally expensive for shorter sequences.
Background
The concept connects to human cognition: humans use lossy compression when memorizing. Three approaches compared:
- RNNs: Compress memories into fixed-size vectors (space-efficient but hard to parallelize)
- Transformers: Store all past memories uncompressed (better performance but increasing computational/memory costs)
- Sparse attention: Reduces computation while maintaining spatial costs
Model Design and Training
The Compressive Transformer uses attention mechanisms over both memories and compressed memories. Training employs local auxiliary losses to avoid long backpropagation-through-time, with a clever approach: rather than reconstructing original memory vectors, the model reconstructs attention vectors instead.
Practical Concerns
- Auxiliary loss trains only the compression module to prevent harming the main network
- Batch accumulation (4x larger batches) improves performance
- Gradient norms clipped to 0.1 for stability
- Convolution works best for memory compression
Further Questions
Two open questions: why do Compressive Transformers improve rare word modeling, and could we explore adaptive compression rates or using RNNs as compressors?