About BERT
It is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
The research paper from Google that proposes BERT is found here (https://arxiv.org/pdf/1810.04805.pdf). It is a must read.
How is it trained ?
The model is pre-trained using two novel unsupervised prediction tasks described below:
BERT uses a simple approach for this: Mask out 15% of the words in the input, run the entire sequence through a deep Bidirectional Transformer encoder, and then predict only the masked words. For example:
Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus ?
Sentence A: the man went to the store.
Sentence B: he bought a gallon of milk.
Label: IsNextSentence
Sentence A: the man went to the store.
Sentence B: penguins are flightless.
Label: NotNextSentence
Architecture
There are two models pre-trained depending on the scale of the model architecture namely BASE and LARGE.
BERT BASE:
Number of Layers =12
Number of hidden nodes = 768
Number of Attention heads =12
Total Parameters = 110M
BERT LARGE:
Number of Layers =24,
Number of hidden nodes = 1024
Number of Attention heads =16
Total Parameters = 340M
The TensorFlow code and pre-trained models for BERT are present in GitHub link below
https://github.com/google-research/bert
2 important notes
Multilingual is not performing good, so focus on one language.
Use the base model, since the large model will ask a lot from even the strongest of GPU’s