NLP¶

Application: 0. word embedding (base of all other NLP task)

Sentence Pair Classification
Single Sentence Classification
Question Answering
Single Sentence Tagging

word2vec¶

Efficient estimation of word representations in vector space (ICLR 2013)
an encoder to embed words (reduce dimension) with 2 tasks:

Skip-gram: given the word, predict the context (before / after the input)
Continuous Bag of Words (CBOW): given the surrounding words, predict the word in the middle

input: verb encoding/ one-hot encoding
output: huffman tree encoding (a searching efficent one-hot)

Use the hidden layer as embedding for other tasks (Transfer learning)

Distributed Representations of Words and Phrases and their Compositionality (NIPS 2013)

subsampling of the frequent words
Negative Sampling (alt to Hierarchical Softmax)

google code archive | word2vec Parameter Learning Explained | The Illustrated Word2vec

GloVe¶

Global Vectors for Word Representation Project page

ELMo¶

Embeddings from Language Models
Deep contextualized word representations (NAACL 2018) | AllenNLP code
bi-LSTM
Contextual/context-sensitive features: instead generate a representation of each word that is based on the other words in the sentence.
ELMo collapses all layers in R into a single vector.
The contextual representation of each token is the concatenation of the left-to-right and right-to-left(single direction) representations

GPT¶

Improving Language Understanding by Generative Pre-Training (2018) | OpenAI blog
Using Transformer, undirectional

GPT-3¶

Language Models are Few-Shot Learners - OpenAI
It is not open-source and the computation cost is high even for inference. OpenAI provide API.

Image GPT¶

OpenAI blog

BERT¶

Bidirectional Encoder Representations from Transformers
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) | Google AI Blog | Reddit

almost identical to the original Attention is all you need
2 Training task:

Masked LM: CBOW cannot be used to traing on multi-layer bi-directional model because it “see itself”. 15% words selected; 80% changed to “masked” label, 10% changed to other word, 10% unchanged.
Next Sentence Prediction

Multi-Head Attention¶

Attention in same layer could be shared to reduce computational cost. What Does BERT Look At? An Analysis of BERT’s Attention (BlackBoxNLP 2019) - analyse the attention heads