NLP¶
Application: 0. word embedding (base of all other NLP task)
- Sentence Pair Classification
- Single Sentence Classification
- Question Answering
- Single Sentence Tagging
word2vec¶
Efficient estimation of word representations in vector space (ICLR 2013)
an encoder to embed words (reduce dimension) with 2 tasks:
- Skip-gram: given the word, predict the context (before / after the input)
- Continuous Bag of Words (CBOW): given the surrounding words, predict the word in the middle
input: verb encoding/ one-hot encoding
output: huffman tree encoding (a searching efficent one-hot)
Use the hidden layer as embedding for other tasks (Transfer learning)
Distributed Representations of Words and Phrases and their Compositionality (NIPS 2013)
- subsampling of the frequent words
- Negative Sampling (alt to Hierarchical Softmax)
google code archive | word2vec Parameter Learning Explained | The Illustrated Word2vec
GloVe¶
Global Vectors for Word Representation Project page
ELMo¶
Embeddings from Language Models
Deep contextualized word representations (NAACL 2018) |
AllenNLP code
bi-LSTM
Contextual/context-sensitive features: instead generate a representation of each word that is based on the other words in the sentence.
ELMo collapses all layers in R into a single vector.
The contextual representation of each token is the concatenation of the left-to-right and right-to-left(single direction) representations
GPT¶
Improving Language Understanding by Generative Pre-Training (2018) | OpenAI blog
Using Transformer, undirectional
GPT-3¶
Language Models are Few-Shot Learners - OpenAI
It is not open-source and the computation cost is high even for inference. OpenAI provide API.
Image GPT¶
BERT¶
Bidirectional Encoder Representations from Transformers
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) |
Google AI Blog |
Reddit
almost identical to the original Attention is all you need
2 Training task:
- Masked LM: CBOW cannot be used to traing on multi-layer bi-directional model because it “see itself”. 15% words selected; 80% changed to “masked” label, 10% changed to other word, 10% unchanged.
- Next Sentence Prediction
Multi-Head Attention¶
Attention in same layer could be shared to reduce computational cost. What Does BERT Look At? An Analysis of BERT’s Attention (BlackBoxNLP 2019) - analyse the attention heads