Normalization

PS: usually use before activication

Local Response Normalization

LRN from AlexNet

\[b^i_{x,y}=\frac{a^i_{x,y}}{\Big(k+\frac{\alpha}{n}\sum^{min(N-1, c+n/2)}_{c=max(0,i-n/2)} (a^c_{x,y})^2 \Big)^\beta}\]

sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.
In later development, it is claimed that LRN is not useful.

Pixelwise Feature Vector Normalization (ICLR 2018)

from Progressive GAN variant of LRN with all channels, i.e. n=N

\[b_{x,y}=\frac{a_{x,y}}{sqrt{\frac1{N}\sum^{N-1}_{j=0}(a^j_{x,y})^2+\epsilon}}\]

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (ICML 2015)
normalize x based on value over all values within same mini-batch with mean and variance
learnable parameters: scale and bias \(\gamma, \beta\)
For Prediction, use population mean and population variance calucated during training.
The effectiveness diminishes when the training minibatches are small.

Why BatchNorm works?

Original paper said BatchNorm reduce Internal Covariate Shift: the change in the distribution of network activations due to the change in network parameters during training.
How Does Batch Normalization Help Optimization? (NIPS 2018) demonstrate is NOT Internal covariate shift and suggest it makes the optimization landscape significantly smoother.
High Frequency Component Helps Explain the Generalization of Convolutional Neural Networks (CVPR 2020) suggested that batchNorm pick more high frequence component, which reduce robustness. It is probably why Batch Normalization is a Cause of Adversarial Vulnerability (ICML 2019).

Layer Normalization

Layer Normalization (NIPS 2016)
no batch size required
for shared weight model, e.g. RNN, CNN(but not good for CNN)
might reduce the representivity of model

Weight Normalization

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (NIPS 2016)
adv:

  1. not relay on batch
  2. less noise
  3. less computation since not saving avg and variance of batch data

disadv:

  1. need good initialization, high dependency of input data
  2. unstable on training

Instance normalization

Instance Normalization: The Missing Ingredient for Fast Stylization (2016)
like batch norm with 1 batch size (still normalize though heightxwidth)

SELU (NIPS 2017)

Activation functions/SELU

Group Normalization

Group Normalization (ECCV 2018)
../_images/group_norm_comparison.png

Conditional BatchNorm

Modulating early visual processing by language (NIPS 2017)
pytorch | guessWhat?!
\(\gamma, \beta\) learnt from one-hidden-layer MLP rely on input
used in cGAN with projection discriminator and SAGAN

AdaIN

Stands for Adaptive Instance Normalization
Arbitrary style transfer in real-time with adaptive instance normalization (ICCV 2017)
scale with output domain distribution

\[AdaIN(x,y)=\sigma_y\big(\frac{x-\mu_x }{\sigma_x}\big)+\mu_y\]

Batch Renormalization

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (NIPS 2017) by Google
data in batch cannot cover the real data distribution -> batchNorm perform badly using moving avg mean and SD -> the gradient optimization and the normalization to counteract each other -> model blow up novel re-normalization apply on x (before scaling) mean and SD of batch: \(\mu_B \sigma_B \)
moving avg (larger than 1 batch) of mean and SD: \(\mu \sigma \)

\[\begin{split}\mu_B & \leftarrow \frac1{m}\sum^m_{i=1}x_i \\ \sigma_B & \leftarrow \sqrt{\epsilon + \frac1{m}\sum^m_{i=1}(x_i-\mu_B)^2}\\ r & \leftarrow stop_gradient(clip_{[1/r_{max}, r_{max}]}(\frac{\sigma_B}{\sigma})) \\ d & \leftarrow stop_gradient(clip_{[-d_{max},d_{max}]})(\frac{\mu_B-\mu}{\sigma}) \\ \hat{x}_i& \leftarrow \frac{x_i-\mu_B}{\sigma_B}\dot r + d y_i & \leftarrow \gamma \hat{x}_i + \beta\end{split}\]

SPADE

stands for SPatially-Adaptive (DE)normalization
Semantic Image Synthesis with Spatially-Adaptive Normalization (CVPR 2019) by Nvidia
https://nvlabs.github.io/SPADE/images/teaser_high_res_uncompressed.png solving issue: semantic information washed away through stacks of convolution, normalization, and nonlinearity layers similar to batchNorm, activation is normalized in the channelwise manner, and then modulated with learned \(\gamma, \beta\)
Unlike prior conditional normalization, \(\gamma, \beta\) are not vectors, but tensors with spatial dimensions https://nvlabs.github.io/SPADE/images/method.png
using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation (two-layers convolutional network).

Summary and use cases

usable situation unsuitable application example model(s)
LRN AlexNet, ProgressiveGAN(variant n=N with all channels)
BatchNorm batch reflect real data distribution RNN,small batch size,patch based,WGAN-GP UNet
LayerNorm no batch sized required, RNN
WeightNorm no batch sized required WDSR
InstanceNorm instance specified, e.g. img2img classification DeblurGAN
SELU feed-forward network (FNNs) CNN, RNN audio, features based model
GroupNorm
BatchReNorm any BatchNorm with small batch size OCR
Conditional BatchNorm conditional GAN unconditional SAGAN
AdaIN Style Transfer StyleGAN, U-GAT-IT, FUNIT
SPADE conditional CNN GAN unconditional GauGAN