nutpeabutter

nutpeabutter t1_iy3z9lc wrote

There is indeed a non-zero gradient. However, symmetric initialization introduces a plethora of problems:

  1. The only way to break the symmetry is through the random biases. A fully symmetric network effectively means that individual layers act as a though they are a single weight(1 input 1 output layer), this means that it cannot learn complex functions until the symmetry is broken. Learning will thus be highly delayed as it has to first break the symmetry before being able to learn a useful function. This can explain the plateau at the start.
  2. Similar weights at the start, even if symmetry is broken, will lead to poor performance. It is easy to get trapped in local minima if your outputs are constrained due to your weights not having sufficient variance, there is a reason why weights are typically randomly initalized
  3. Random weights also allow for more "learning pathways" to be established, by pure chance alone, a certain combination of weights will be slightly more correct than others. The network can then abuse this to speed up it's learning, by changing it's other weights to support these pathways. Symmetric weights do not possess such an advantage.
6

nutpeabutter t1_iqpxj3a wrote

Kinda frustrating how half of the help posts here are requrests for laptops. Like have they not bothered to do even the tiniest bit of research?? At the same price you could get an equivalently speced desktop/server AND an additional laptop, with the added bonus of being able to run long training sessions without needing to disrupt it.

2