Attention Mechanism

Attention mechanism mainly used for NLP because of sequence model. CNN could also use self-attention, such as SAGAN.

Background

Before RNNsearch, sequence model is mostly implemented via RNN encoder-decoder. RNN could somehow handle sequence data, but it is heavely rely on the previous data (word for NLP), which is not flexible enough for language model.

seq2seq

Sequence to Sequence Learning with Neural Networks (NIPS 2014)

multilayered LSTM to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

RNNsearch

Neural Machine Translation by Jointly Learning to Align and Translate (2014~ICLR 2015)

In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

Show, Attend and Tell

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (ICML 2015)
Generate word by word based on image

Transformer

Attention is all you need (NIPS 2017) - Google
google AI blog | The Illustrated Transformer | The Annotated Transformer - Harvard Self-attention: sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. (For previous non-self attention, decoder consider the output of encoder. For self-attention, layer (both encoder or decoder) consider other repersentation within sequence.

Scaled Dot-Product Attention

http://nlp.seas.harvard.edu/images/the-annotated-transformer_33_0.png
The input consists of queries \(Q\) and keys \(K\) of dimension \(d_k\), and values \(V\) of dimension \(d_v\) (channel of features) $$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

Multi-Head Attention

small_img

\[\begin{split}MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O \\ where head_i = Attention(QW_i^Q,KW_i^K,VW_i^V)\end{split}\]

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

Non-local Neural Networks

Non-local Neural Networks (CVPR 2018)
Caffe code
spacetime Non-local block

CBAM

CBAM: Convolutional block attention module(ECCV 2018)

Channel attention module (CAM)

Spatial Attention Module (SAM)

DNL

Disentangled Non-Local Neural Networks (ECCV 2020) / Disentangled self-attention models
bilibili explain (chinese) | text version | segmentation pyTorch | Object Detection pyTorch
Tranditional self-attention block can be splited into two terms, a whitened pairwise term and a unary term. The two tightly coupled terms hinders the learning of each. Disentangled non-local block decouple the two terms to facilitate learning for both terms.

denote \(q_i = W_qx_i, k_j=W_k x_j, \sigma(.)=softmax(.)\)
Standard non-local block (add within $\sigma$ ~ multiply)
\(w(x_i, x_j) = \sigma(q_i^Tk_j)=\sigma(\underbrace{(q_i-\mu_q)}_\text{pairwise}+\underbrace{\mu_q^T k_j}_\text{unary})\)
Disentangled non-local block (add unary term use decoupled \(W_m\) isntead of \(W_k\))
\(w^D(x_i, x_j)=\sigma(q_i^Tk_j)=\sigma(\underbrace{(q_i-\mu_q)}_\text{pairwise})+\sigma(\underbrace{W_m x_j}_\text{unary}))\)