About BERT

It is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
The research paper from Google that proposes BERT is found here (https://arxiv.org/pdf/1810.04805.pdf). It is a must read.

How is it trained ?

The model is pre-trained using two novel unsupervised prediction tasks described below:

BERT uses a simple approach for this: Mask out 15% of the words in the input, run the entire sequence through a deep Bidirectional Transformer encoder, and then predict only the masked words. For example:

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon

In order to learn relationships between sentences, we also train on a simple task which can be generated from any monolingual corpus: Given two sentences A and B, is B the actual next sentence that comes after A, or just a random sentence from the corpus ?

Sentence A: the man went to the store.
Sentence B: he bought a gallon of milk.
Label: IsNextSentence

Sentence A: the man went to the store.
Sentence B: penguins are flightless.
Label: NotNextSentence


There are two models pre-trained depending on the scale of the model architecture namely BASE and LARGE.


Number of Layers =12
Number of hidden nodes = 768
Number of Attention heads =12
Total Parameters = 110M


Number of Layers =24,
Number of hidden nodes = 1024
Number of Attention heads =16
Total Parameters = 340M

The TensorFlow code and pre-trained models for BERT are present in GitHub link below

2 important notes

Multilingual is not performing good, so focus on one language.
Use the base model, since the large model will ask a lot from even the strongest of GPU’s

0 Reacties

Geef een antwoord

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *


©2023 Mobile & Wearables // Kenniscentrum // ERASMUSHOGESCHOOL Brussel

Login met je gegevens


Je gegevens vergeten?

Create Account