By Manzil Zaheer, Guru Guruganesh, Avinava Dubey et al (Google Research), 2020
In this paper, the authors present a Transformer attention model with linear complexity that is mathematically proven to be Turing complete (and thus as powerful as the original quadratic attention model) and achieves new state-of-the-art results on many NLP tasks involving long sequences (e.g. question answering and summarization), as well as genomics data.
What can we learn from this paper?
That Transformer models can be effectively applied to extra-long sequences using this new attention approach.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Neural attention and Transformers
- Matrix algebra, graph theory and functional analysis (only if full understanding of the new model’s mathematical foundation is desired).
While Transformer models have been dominating the state of the art in NLP over the last several years, their inefficiency with long sequences, stemming from the need to calculate O(n2) key and query attention product, is well known. There have been multiple attempts to simplify this calculation and thus enable the full use of these networks for longer texts, including the recent Linformer model from Facebook AI, which has proven that the attention matrices are effectively low rank and can be approximated by smaller ones, which in that model is achieved by using neural linear projection layers, as well as Longformer from Allen institute for AI, which used a combination of window attention (attending to the close neighbors of each token) and global attention (having a few global tokens that attend to every other token in the sequence).
This paper is Google’s answer to the same problem, and achieves the reduction in matrix dimension and computational complexity by calculating attention, in addition to the window attention and global attention as in the Longformer, for a fixed number of random tokens (see the picture below).
A large portion of the paper and its extra long Appendix is devoted to mathematically proving that the new model (BigBird) is as expressive (Turing complete) as the original Transformer model. The paper discusses some limitations of the BigBird approach, or any O(n) approach for that matter, to show that in some specifically constructed cases this leads to needing n layers to perform the same task that could be performed with one layer in the original model, that is, there is no free lunch. However, as in the case of the general No Free Lunch Theorem, this is just a theoretical result that shouldn’t have much practical significance as long as the model performs better on the actual data of interest.
The global tokens in the model can either be taken from the existing tokens in the sequence (the BigBird-ITC version) or added to the inputs similarly to CLS tokens in classic transformers (the BigBird-ETC version), thus providing extra space for global context, and, as can be seen from the results, usually improving the performance of the model.
The new model was pre-trained on 4 standard NLP corpora: Books, CC-News, Stories, and Wikipedia, using masked language modeling (MLM), that is, masking about 15% of the input tokens at random and then trying to predict them. The hyperparameters of each BigBird version and other training details are given in Appendix E of the paper.
Curiously, in Tables 12, 14, and 15 the number of random tokens is indicated as 0 for the ETC version, which essentially makes the model very similar to the Longformer (with possible differences in implementation). Furthermore, this seems to contradict the statement at the bottom of page 40, which is “We only use gather for the random attention, which is very small (r = 3 for all of our experiments)”. If it was in fact r = 0 for many of the tasks, it would be interesting to know why this was done only for the ETC version and how the authors of the paper would explain the superior performance of BigBird over Longformer in these cases, given the same overall approach. In general, it is understandable that the authors wanted to publish this as soon as possible given the competition and the speed of development in this field, but, probably as a result of that, the paper seems a bit rushed (including quite a few typos in the current version).
The reported performance of BigBird on various benchmark tasks of question answering and especially summarization (chosen since these often require working with longer sequences) is quite impressive, setting a number of new state-of-the-art results. Interestingly, the authors also applied BigBird to DNA sequencing, where it also performed better than most of the previous models.
Overall, this seems like an important paper in the development of better, more efficient Transformer models. Hopefully, the relative importance of BigBird’s random attention component will be clarified in future works.
P.S. (added August 8, 2020)
After this review was published, Manzil, the first author of the paper, reached out to me to clarify some of the issues I mentioned, which was very nice of him. Here is my understanding of the gist of what he said:
1) The main reason why BigBird performs better than its competition (e.g. Longformer) is due to its more elaborate design, not new features. In particular, the new random attention used in the ITC version, while also performing slightly better than Longformer due to overall design, is not as effective in most cases as the explicitly-designed global attention in ETC.
2) Adding random attention (r > 0) to the ETC model didn’t result in a significant increase in performance, suggesting that the external attention tokens are sufficient for global attention. For this reason, random attention was not used in ETC.
3) The ETC version performs better for datasets that have more structure. For less structured data (e.g. TriviaQA), the external global attention of ETC does not significantly help (in fact, according to Table 4 of the paper, the ITC version performed better on this one for the base size models).