NN Initialization

Weight Initialization

All-Zero Initialization

It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.

Small random values

One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as \(W=0.01*np.random.randn(D,H) ​\) . It is also problematic because very small weights cause very small updates and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.

\(1/\sqrt{n}\) Normalization

So in order to keep the data flow stable during the backpropagation, we want the variance of each neuron’s output and input to be the same. The expectations(mean) of both \(W\) and \(x\) are all initialized to be 0 so we only need to care about the variance. Then let’s say \(s\) is the output and \(x\) is the input, our target will be:

\[Var(s)=Var(x)\]

What we get now is:

\[Var(s)=Var(\sum_i^n{w_ix_i})\]

according to the variance property, we get:

\[Var(\sum_i^n{w_ix_i})=\sum_i^n{Var(w_ix_i)}\]

as we know that \(w_i\) and \(x_i\) are independent, we get:

\[\sum_i^n{Var(w_ix_i)}=\sum_i^n{Var(x_i)Var(w_i)}\]

considering all \(x_i\) are equally distributed, we use \(Var(x)\) to represent the variance of each \(x_i\); the same for \(w_i\), we use \(Var(w)\) to represent the variance of each \(w_i\). Then we get:

\[Var(s)=\sum_i^n{Var(x_i)Var(w_i)}=\sum_i^n{Var(w)Var(x)}=(nVar(w))Var(x)\\ Var(s)=(nVar(w))Var(x)\]

It is clear now that in order to make the variance of output \(Var(s)\) to be the same as the variance of input \(Var(x)\), we have to make:

\[nVar(w)=1\]

considering the variance’s property, for some \(X\) and scalar \(a\):\(Var(aX)=a^2Var(X)\)

So if we initialize the weights as: \(w=np.random.randn(n)/sqrt(n)\)

We can guarantee \(nVar(w)=1\).

According to the latest research, when using ReLu, the variance should be \(2/n\), so we should use \(w=np.random.randn(n)*sqrt(2.0/n)\) instead.

To be studied later : Batch Normalization

Batch normalization is confirmed to be very helpful, need to be studied . . .

Wangxin

I am algorithm engineer focused in computer vision, I know it will be more elegant to shut up and show my code, but I simply can't stop myself learning and explaining new things ...