What's New in XLNet?

R.I.P BERT

BERT got a head shot yesterday, by another guy called XLNet. It is reported that XLNet defeated BERT on 20 NLP tasks, and achieved 18 new state-of-the-art results. Isn’t it impressive? So, farewell, BERT.

R.I.P BERT

Is BERT really dead?

Since I love BERT, I decided to read the paper to find out what killed him. While reading, I was thinking wait a minute, is BERT really dead? After finished the paper, I was so glad to know that BERT is still well alive! He is just wearing another coat named Two-Stream Self-Attention (TSSA), with some other gadgets! Because:
XLNet = BERT + TSSA + bidirectional data input
Bert you’re so tough, buddy!

Let’s take a closer look at what were trying to kill BERT.

Two-stream self-attention (TSSA)

Why TSSA is needed to kill BERT? Well, let’s first see some weaknesses BERT has.

BERT is using a masked language model (MLM) training objective, which is essentially why it achieves bidirectional representation.

Image source

In this example, both words “store” and “gallon” are intended to be predicted by BERT, and their input word embeddings are replaced by the embedding of a special token [MASK]. Usually this isn’t a problem, but what if the prediction of “store” requires knowing the word “gallon”? That is exactly where BERT falls short.

TSSA is what you can use to overcome that downside of MLM:

Query stream, source

In this illustration, query stream gives you the query vector needed for attention calculation, and this stream is designed in such a way that it doesn’t leak the info of the word it’s going to predict, but guarantees all information from other positions. Take $x_1$ for example: $x_1$'s embedding (and hidden state) is not used at all, but embeddings and hidden states from other positions are used in each layer.

Content stream, source

Content stream, on the other hand, gives you the key and value vectors needed for context vector calculation. This stream uses a strategy similar to that in a standard Transformer decoder by masking future positions. The only difference is that in content stream, the order of tokens is randomly permuted. For example $x_2$ is right after $x_3$, and therefore $h_2^{(1)}$ can only see the embedding of itself and that of $x_3$ (and $mem^{(0)}$), but not that of $x_1$ or $x_4$.

Mask a span

Another difference from BERT is masking a span of consecutive words. The reason I guess, is that this guarantees the dependence of masked words (as claimed to be what BERT can’t model). This is not a fresh-new idea, though. Recently there are two ERNIE papers (BERT based) that propose masking named entities (often of multiple words, paper link) and/or phrases ( paper link).

Bidirectional data input

Another notably different thing in XLNet is the usage of bidirectional data input. The idea (I guess) is to decide the factorization direction (either forward or backward), so that the idea of “masking future positions” used in a standard Transformer decoder can also be easily used together with XLNet.

Masking a span makes XLNet look like a denoising autoencoder; but by using bidirectional data input (or masking future positions), XLNet performs more like a autoregressive language model in the masked region.

Closing remarks

So now you probably can see the similarities and differences between XLNet and BERT. If not, here is a quick summary:

  • Instead of masking random words, mask a span of words
  • Use bidirectional data input to decide which direction you treat as “future”, and then apply the idea of masking future positions
  • To avoid leaking the information of the position to be predicted, use Two-Stream Self-Attention (TSSA)
  • Other minor things like segment recurrence, relative positional encoding, etc.

However, it doesn’t seem to be enough changes to make all those improvements. What if BERT is also trained using the additional data (Giga5, ClueWeb, Common Crawl), will XLNet still be able to defeat BERT?

EDIT:

  • Another model named MASS employs a very similar idea.
  • According to Jacob Devlin (author of BERT), relative positional embedding might be of great importance.
PhD student at ILPS

Opinions are mine. Leave a comment below if you want to discuss the content.

comments powered by Disqus