Back to Blog
Jun 20, 2019 < 1 min read Paper Notes
What's New in XLNet?
Examining how XLNet improved upon BERT's architecture with Two-Stream Self-Attention and bidirectional data input.
NLP Deep Learning BERT Transformer XLNet
Let me break down what makes XLNet different from BERT.
Main Formula: XLNet = BERT + TSSA + bidirectional data input
Two-Stream Self-Attention (TSSA)
TSSA addresses BERT’s limitation with masked language modeling. When BERT masks tokens, it cannot learn dependencies between masked words. TSSA solves this through:
- Query stream: Provides attention query vectors without leaking information about the target word being predicted
- Content stream: Supplies key/value vectors using randomly permuted token order, similar to Transformer decoder masking
Additional Innovations
- Masking consecutive word spans rather than random individual words
- Using bidirectional data input to determine “future” direction
- Implementing relative positional encoding and segment recurrence
Closing Observation
One question remains: do the architectural improvements alone account for XLNet’s performance gains, or does additional training data (Giga5, ClueWeb, Common Crawl) significantly contribute?
BERT remains viable; XLNet simply represents architectural evolution rather than obsolescence.