Comments
Own-Archer7158 t1_iy3h8pp wrote
If the learning rate is zero, the update rule of the params makes the params unchanged
The data balancing does not change the loss (it only changes the overfitting) and same for the regularization strength too low
Bad initialization is rarely a problem (with a lack of chance you could get a local minimum directly but rare event)
canbooo t1_iy3pylo wrote
Bad initialization can be a problem if you do it yourself (i.e. bad scaling of weights) and if you are not using batch or other kinds of normalizations, since it might make your neurons die. E.g. a tanh neuron with too large input scale will only predict -1 or 1 for all data, which leads it to being dead, i.e. not learning anything due to 0 grad for the entire data set.
Own-Archer7158 t1_iy3q6b6 wrote
You are right, thank you
nutpeabutter t1_iy3kb5n wrote
>Bad initialization is rarely a problem
What if all weights are the same?
Own-Archer7158 t1_iy3m6j0 wrote
If all weight are the same (assume 0 to be simple) then the output of the function/neural network is far from the objective/label
The gradient is therefore non zero
And finally the parameters are updated : theta = theta + learning_rate*grad_theta(loss)
And when the parameters are updated the loss is changed
Usually, the parameters are randomly choosen
nutpeabutter t1_iy3z9lc wrote
There is indeed a non-zero gradient. However, symmetric initialization introduces a plethora of problems:
- The only way to break the symmetry is through the random biases. A fully symmetric network effectively means that individual layers act as a though they are a single weight(1 input 1 output layer), this means that it cannot learn complex functions until the symmetry is broken. Learning will thus be highly delayed as it has to first break the symmetry before being able to learn a useful function. This can explain the plateau at the start.
- Similar weights at the start, even if symmetry is broken, will lead to poor performance. It is easy to get trapped in local minima if your outputs are constrained due to your weights not having sufficient variance, there is a reason why weights are typically randomly initalized
- Random weights also allow for more "learning pathways" to be established, by pure chance alone, a certain combination of weights will be slightly more correct than others. The network can then abuse this to speed up it's learning, by changing it's other weights to support these pathways. Symmetric weights do not possess such an advantage.
Own-Archer7158 t1_iy3mec9 wrote
Note that the minimal loss is reached when the parameters make neural network predictions the closest to the real labels
Before that, the gradient is non zero generally (except for an very very unlucky local minimum)
You could see the case of the linear regression with least square error as loss to understand better the underlying optimization problem (in one dimension, it is a square function to minimize, so no local minimum)
Ok_Firefighter_2106 t1_iy6t95h wrote
2,3
2:For example you use zero values for initialization, due to the symmetric nature of NN, now all neurons become the same, then the multi-layer NN is equal to a simple linear regression since the NN fails to break the symmetry. Therefore, is the problem is non-linear, the NN just can't learn.
​
3: as explained in other answers.
Own-Archer7158 t1_iy3h1oa wrote
3 is the only possible solution