Activation Functions

Introduce non-linearity into the output of a neuron

Step

\[\begin{split} step(x)= \lambda \begin{cases} 0 & \text{if } x< 0\\ 0.5 & \text{if } x= 0\\ 1 & \text{if } x\> 0 \end{cases}\end{split}\]

could be used for hash layer

Signum

\[\begin{split} sign(x)= \lambda \begin{cases} -1 & \text{if } x< 0\\ 0 & \text{if } x= 0\\ 1 & \text{if } x\> 0 \end{cases}\end{split}\]

could be used for hash layer

Sigmoid

a.k.a. logistic
saturated, monotonic
derivative non-monotonic

\[f(x)=\sigma(x)=\dfrac{1}{1+e^{-x}}\]
\[f'(x)=f(x)(1-f(x))\]

Tanh

saturated, monotonic
derivative non-monotonic

\[f(x)=tanh(x)=\dfrac{e^x-e^{-x}}{e^x+e^{-x}}\]
\[f'(x)=1-f(x)^2\]

ReLU

Rectified linear unit non-saturated, monotonic, derivative monotonic fast and solve gradient vanishing, but might cause gradient explosion could be considered as dropout when input < 0

\[\begin{split}f(x) = ReLU(x) = max(0,x) = \begin{cases} 0 , & \text{if } x\leq 0\\ x , & \text{if } x> 0 \end{cases}\end{split}\]
\[\begin{split}f'(x) = \begin{cases} 0 , & \text{if } x\leq 0\\ 1 , & \text{if } x> 0 \end{cases} \end{split}\]

Leaky ReLU

\[\begin{split}f(x) = \begin{cases} 0.01x , & \text{if } x\leq 0\\ x , & \text{if } x> 0 \end{cases}\end{split}\]

PReLU - Parametric ReLU

Delving Deep into Rectifiers:Surpassing Human-Level Performance on ImageNet Classification

\[\begin{split}f(x) = \begin{cases} \alpha x , & \text{if } x\leq 0\\ x , & \text{if } x> 0 \end{cases}\end{split}\]

The momentum method is adopted when updating \(a_i\)

\[\Delta\alpha_i := \mu\Delta\alpha_i+\epsilon\dfrac{\partial\varepsilon}{\partial\Delta\alpha_i}\]

where µ is the momentum and \(\epsilon\) is the learning rate

ELU - Exponential Linear Uints (2015)

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

\[\begin{split}f(\alpha,x)= \begin{cases} \alpha (e^x-1) , & \text{if } x\leq 0\\ x , & \text{if } x> 0 \end{cases}\end{split}\]

SELU - Scaled Exponential Linear Unit (NIPS 2017)

Self-Normalizing Neural Networks
usually use to replace normalization layer

\[\begin{split} selu(x)= \lambda \begin{cases} x , & \text{if } x> 0\\ \alpha e^x-\alpha, & \text{if } x\leq 0 \end{cases}\end{split}\]

α and λ are derived from the inputs. For standard scaled inputs (mean 0, stddev 1), the values are α=1.6732~, λ=1.0507~ Alexia Jolicoeur-Martineau said replacing batchNorm and ReLUs with SELUs help training high-resolution DCGAN

Softmax

mapping multiple neruals output to [0,1]
usually used in last layer to output probability of each label for classification

\[\frac{\partial S_i}{\partial a_j} = \frac{\partial}{\partial a_j}\frac{e^{a_i}}{\sum^N_{k=1}e^{a_k}}\]
\[\text{quotient rule of derivative. For }f(x)=\frac{g(x)}{h(x)}, f'(x)=\frac{g'(x)h(x)-h'(x)g(x)}{[h(x)]^2}\]
\[ \begin{align}\begin{aligned}\begin{split}\text{for } i=j, & \frac{\partial}{\partial a_j}\frac{e^{a_i}}{\sum^N_{k=1}e^{a_k}} & =\frac{e^{a_i}\sum-e^{a_j}e^{a_i}}{\sum^2} & =\frac{e^{a_i}}{\sum}\frac{\sum-e^{a_j}}{\sum} & =S_i(1-S_j) \\\end{split}\\\text{for }i\neq j, & \frac{\partial}{\partial a_j}\frac{e^{a_i}}{\sum^N_{k=1}e^{a_k}} & = \frac{0-e^{a_j}e^{a_i}}{\sum^2} & =-\frac{e^{a_i}}{\sum}\frac{e^{a_j}}{\sum} & =-S_i S_j\end{aligned}\end{align} \]