Viewing a single comment thread. View all comments

float16 t1_it1mphe wrote

I looked at it. He misspoke and probably meant that batch normalization makes the preactivations closer to the domain of the activation function (tanh in this case) where the derivative is far from 0.

Also, "Gaussian" is often used to refer to the standard normal distribution.

Same kind of deal when lots of people say "convolution" when they mean "cross correlation."

To answer your question, no, it is not necessary, but good researchers often have a solid math foundation.

8