By Sinong Wang, Belinda Z. Li, Madian Khabsa et al (Facebook AI Research), 2020
This paper suggests an approximate way of calculating self-attention in Transformer architectures that has linear space and time complexity in terms of the sequence length, with the resulting performance on benchmark datasets similar to that of the RoBERTa model based on the original Transformers with much less efficient quadratic attention complexity.
What can we learn from this paper?
That the full attention matrix can be efficiently approximated by a much smaller one by projecting the key and value matrices into lower fixed-sized dimensions.
Prerequisites (to understand the paper, what does one need to be familiar with?)
- Matrix algebra
- Neural attention and Transformers
Due to the quadratic space and time complexity of the attention dot product in standard Transformer models, training and deploying these models for long sequences is very expensive and/or time-consuming. Recently, there have been many attempts to improve the efficiency of Transformers’ attention mechanism, such as, for example, by using locality-sensitive hashing in Google’s 2020 Reformer model. However, all of these attempts resulted only in moderate gains in efficiency, often at the cost of reduced performance.
In this paper, the authors show both theoretically and empirically that the attention dot product matrix can always be closely approximated by a matrix of much lower rank, which can be obtained by calculating the singular value decomposition (SVD) of the key and value matrices and only using a fixed number of the largest singular values. Since SVD calculation also adds computational complexity, the authors replace it with a linear projection layer (as shown in the picture) that serves to reduce the dimensionality. For simplicity, the trainable projection layer parameters can be shared between attention heads and/or between layers.
It is shown that the performance of both the pre-trained and fine-tuned Linformer models is similar to and sometimes exceeding that of RoBERTa, a popular state-of-the-art BERT variant. At the same time, the inference time speed-up and memory savings can be 20-60 fold for longer sequences.
While the suggested approach still required training on 64 Tesla V100 GPUs, it certainly seems more efficient and thus very promising for using the pre-trained models in downstream applications. Also, as the authors point out, the ability to train on longer sequences creates the potential of training Transformer models on images more easily.