NN Initialization
Weight Initialization
All-Zero Initialization
It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.
Small random values
One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as \(W=0.01*np.random.randn(D,H) \) . It is also problematic because very small weights cause very small updates and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.
\(1/\sqrt{n}\) Normalization
So in order to keep the data flow stable during the backpropagation, we want the variance of each neuron’s output and input to be the same. The expectations(mean) of both \(W\) and \(x\) are all initialized to be 0 so we only need to care about the variance. Then let’s say \(s\) is the output and \(x\) is the input, our target will be:
\[Var(s)=Var(x)\]What we get now is:
\[Var(s)=Var(\sum_i^n{w_ix_i})\]according to the variance property, we get:
\[Var(\sum_i^n{w_ix_i})=\sum_i^n{Var(w_ix_i)}\]as we know that \(w_i\) and \(x_i\) are independent, we get:
\[\sum_i^n{Var(w_ix_i)}=\sum_i^n{Var(x_i)Var(w_i)}\]considering all \(x_i\) are equally distributed, we use \(Var(x)\) to represent the variance of each \(x_i\); the same for \(w_i\), we use \(Var(w)\) to represent the variance of each \(w_i\). Then we get:
\[Var(s)=\sum_i^n{Var(x_i)Var(w_i)}=\sum_i^n{Var(w)Var(x)}=(nVar(w))Var(x)\\ Var(s)=(nVar(w))Var(x)\]It is clear now that in order to make the variance of output \(Var(s)\) to be the same as the variance of input \(Var(x)\), we have to make:
\[nVar(w)=1\]considering the variance’s property, for some \(X\) and scalar \(a\):\(Var(aX)=a^2Var(X)\)
So if we initialize the weights as: \(w=np.random.randn(n)/sqrt(n)\)
We can guarantee \(nVar(w)=1\).
According to the latest research, when using ReLu, the variance should be \(2/n\), so we should use \(w=np.random.randn(n)*sqrt(2.0/n)\) instead.
To be studied later : Batch Normalization
Batch normalization is confirmed to be very helpful, need to be studied . . .