This paper describes a new training approach for Transformer network architectures used for language modeling tasks. The authors demonstrate that their technique results in greatly improved training efficiency and better performance on common benchmark datasets (GLUE, SQuAD) compared to other state-of-the-art NLP models of similar size.
What can we learn from this paper?
That training a Transformer network as a discriminator using the suggested method appears superior to BERT’s masked language modeling approach.
Prerequisites (to understand the paper, what does one need to be familiar with?)
- Natural Language Processing
- Basic concepts of Generative Adversarial Networks (GANs)
- Transformer networks
To improve the efficiency of training for state-of-the-art Transformer networks.
One of the main advantages of modern NLP models stems from their ability to be trained on unlabeled data sets. In language modeling, unlabeled data is plentiful, while labeling is expensive and may only cover a limited subset of situations that can be encountered in a real-world language. However, this raises a question of how unlabeled data can be used to train a neural network under a supervised learning framework.
Since in language modeling what we are interested in is the actual structure of the data generated under an expansive but somewhat rigid set of rules of grammar and syntax (as opposed to, say, visual classification tasks, where the arrangement of pixels in images and their correlation to the desired labels is much less clearly defined), a natural approach for generating training examples from unlabeled language data is to simply remove some of the words from sentences and have the model predict these words in a way that restores the structure. This is the essence of BERT‘s Masked Language Modeling (MLM) approach.
While MLM has shown significant success, and BERT and its variations are widely used in modern NLP, replacing words with a [MASK] token during training has its disadvantages. Paricularly, only a small portion (about 15% in the standard BERT approach) of all words can be masked; removing a much larger proportion would result in too much ambiguity and a wide range of possible answers. Thus, only 15% of all data is effectively used for training, which is quite inefficient.
Instead of masking words, the paper suggests using a small BERT-like network as a generator to replace some words (also about 15%) with its predictions. Then, the main discriminator network is used to determine which of the words have been replaced. Both networks are trained together. It is important that the generator network is not too good with its replacements, otherwise it will be impossible for the discriminator to do its job. In this approach, the main network has to look at 100% of the words in the text, resulting in much better efficiency of training compared to BERT.
The authors discuss various implementation details of ELECTRA (which stands for Efficiently Learning an Encoder that Classifies Token Replacement Accurately) and present comparison of training time and accuracy results on standard GLUE and SQuAD datasets for different sizes of ELECTRA against various state-of-the-art approaches (BERT, RoBERTa, XLNet, etc). In all cases the new approach requires significantly less compute time to achieve the same accuracy, and when the model is trained further, has superior accuracy compared to other models.
Based on the presented results, ELECTRA definitely looks like a very promising new approach and an important addition to modern NLP tools. A github repository is available, including several pre-trained models.