β Computationally-efficient β Discontinuous at \(x=0\) β Dead neurons due to poor initialization, high learning rate; initialize with slight +ve bias
β \(\exp\) is computationally-expensive; though not significant in large networks
Maxout
\(\max(w_1 x + b_1, w_2 x + b_2)\)
β
β
Generalization of ReLU and Leaky ReLU β double the no of parameters
Generalized Logistic
\(a + (b-a) \dfrac{1}{1+e^{-k(x-x_0)}}\)
\(a=\) minimum \(b=\) maximum \(k=\) steepness \(x_0 =\)\(x\) center
\(\ln \left \vert \dfrac{x-a}{b-x} \right \vert\)
what about \(k\)
Continuous
\([a, b]\)
Depends on \(a\) and \(b\)
β
β \(\exp\) is computationally-expensive; though not significant in large networks β Easy to interpret - "probabilistic" - saturating "firing rate" of neuron
Sigmoid/ Standard Logistic/ Soft Step
\(\dfrac{1}{1+e^{-x}}\)
\(\ln \left \vert \dfrac{x}{1-x} \right \vert\)
Binary-Continuous
\([0, 1]\)
β
β \(\exp\) is computationally-expensive; though not significant in large networks β Easy to interpret - "probabilistic" - saturating "firing rate" of neuron
Fast Softsign Sigmoid
\(0.5 \Bigg( 1+\dfrac{x}{1 + \vert x \vert} \Bigg)\)
Softmax
\(\dfrac{e^{x_i}}{\sum_{j=1}^k e^{x_j}}\) where \(k=\) no of classes such that \(\dfrac{\sum p_i}{k} = 1\)
Since Non-zero-centered activation function such as sigmoid always outputs +ve values, it constrains gradients of all parameters to be - all +ve or - all -ve
This leads to sub-optimal steps (zig-zag) in the update procedure, leading to slower convergence