Normalization¶

PS: usually use before activication

Local Response Normalization¶

\[b^i_{x,y}=\frac{a^i_{x,y}}{\Big(k+\frac{\alpha}{n}\sum^{min(N-1, c+n/2)}_{c=max(0,i-n/2)} (a^c_{x,y})^2 \Big)^\beta}\]

sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.
In later development, it is claimed that LRN is not useful.

Pixelwise Feature Vector Normalization (ICLR 2018)¶

from Progressive GAN variant of LRN with all channels, i.e. n=N

\[b_{x,y}=\frac{a_{x,y}}{sqrt{\frac1{N}\sum^{N-1}_{j=0}(a^j_{x,y})^2+\epsilon}}\]

Batch Normalization¶

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (ICML 2015)
normalize x based on value over all values within same mini-batch with mean and variance
learnable parameters: scale and bias \(\gamma, \beta\)
For Prediction, use population mean and population variance calucated during training.
The effectiveness diminishes when the training minibatches are small.

Why BatchNorm works?¶

Original paper said BatchNorm reduce Internal Covariate Shift: the change in the distribution of network activations due to the change in network parameters during training.
How Does Batch Normalization Help Optimization? (NIPS 2018) demonstrate is NOT Internal covariate shift and suggest it makes the optimization landscape significantly smoother.
High Frequency Component Helps Explain the Generalization of Convolutional Neural Networks (CVPR 2020) suggested that batchNorm pick more high frequence component, which reduce robustness. It is probably why Batch Normalization is a Cause of Adversarial Vulnerability (ICML 2019).

Layer Normalization¶

Layer Normalization (NIPS 2016)
no batch size required
for shared weight model, e.g. RNN, CNN(but not good for CNN)
might reduce the representivity of model

Weight Normalization¶

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (NIPS 2016)
adv:

not relay on batch
less noise
less computation since not saving avg and variance of batch data

disadv:

need good initialization, high dependency of input data
unstable on training

Instance normalization¶

Instance Normalization: The Missing Ingredient for Fast Stylization (2016)
like batch norm with 1 batch size (still normalize though heightxwidth)

SELU (NIPS 2017)¶

Activation functions/SELU

Group Normalization¶

Group Normalization (ECCV 2018)
../_images/group_norm_comparison.png

Conditional BatchNorm¶

Modulating early visual processing by language (NIPS 2017)
pytorch | guessWhat?!
\(\gamma, \beta\) learnt from one-hidden-layer MLP rely on input
used in cGAN with projection discriminator and SAGAN

Conditional Instance Normalizatoin¶

A learned representation for artistic style (ICLR 2017)

AdaIN¶

Stands for Adaptive Instance Normalization
Arbitrary style transfer in real-time with adaptive instance normalization (ICCV 2017)
scale with output domain distribution

\[AdaIN(x,y)=\sigma_y\big(\frac{x-\mu_x }{\sigma_x}\big)+\mu_y\]

Batch Renormalization¶

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (NIPS 2017) by Google
data in batch cannot cover the real data distribution -> batchNorm perform badly using moving avg mean and SD -> the gradient optimization and the normalization to counteract each other -> model blow up novel re-normalization apply on x (before scaling) mean and SD of batch: \(\mu_B \sigma_B \)
moving avg (larger than 1 batch) of mean and SD: \(\mu \sigma \)

\[\begin{split}\mu_B & \leftarrow \frac1{m}\sum^m_{i=1}x_i \\ \sigma_B & \leftarrow \sqrt{\epsilon + \frac1{m}\sum^m_{i=1}(x_i-\mu_B)^2}\\ r & \leftarrow stop_gradient(clip_{[1/r_{max}, r_{max}]}(\frac{\sigma_B}{\sigma})) \\ d & \leftarrow stop_gradient(clip_{[-d_{max},d_{max}]})(\frac{\mu_B-\mu}{\sigma}) \\ \hat{x}_i& \leftarrow \frac{x_i-\mu_B}{\sigma_B}\dot r + d y_i & \leftarrow \gamma \hat{x}_i + \beta\end{split}\]

have higher improvement when the batch small and data distribution diverse, not easy to over-fit 如何评价batch renormalization？ - 汪汪的回答 - 知乎

SPADE¶

stands for SPatially-Adaptive (DE)normalization
Semantic Image Synthesis with Spatially-Adaptive Normalization (CVPR 2019) by Nvidia
solving issue: semantic information washed away through stacks of convolution, normalization, and nonlinearity layers similar to batchNorm, activation is normalized in the channelwise manner, and then modulated with learned \(\gamma, \beta\)
Unlike prior conditional normalization, \(\gamma, \beta\) are not vectors, but tensors with spatial dimensions
using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation (two-layers convolutional network).

Summary and use cases¶

	usable situation	unsuitable application	example model(s)
LRN	AlexNet, ProgressiveGAN(variant n=N with all channels)
BatchNorm	batch reflect real data distribution	RNN,small batch size,patch based,WGAN-GP	UNet
LayerNorm	no batch sized required, RNN
WeightNorm	no batch sized required		WDSR
InstanceNorm	instance specified, e.g. img2img	classification	DeblurGAN
SELU	feed-forward network (FNNs)	CNN, RNN	audio, features based model
GroupNorm
BatchReNorm	any BatchNorm with small batch size		OCR
Conditional BatchNorm	conditional GAN	unconditional	SAGAN
AdaIN	Style Transfer		StyleGAN, U-GAT-IT, FUNIT
SPADE	conditional CNN GAN	unconditional	GauGAN