Viewing a single comment thread. View all comments

ClearlyCylindrical t1_iqlqykj wrote

I always thought that it was because its derivative was nice to calculate, just sigmoid(x)*(1 - sigmoid(x)).

14

NeilGirdhar t1_iqooxi3 wrote

Plenty of other sigmoid (s-shaped) functions are differentiable.

3

cthorrez OP t1_iqlr7nr wrote

Derivative of log of any CDF is also nice. Derivative of log CDF(x) = PDF(x)/CDF(x).

Plus we have autograd these days. Complicated derivatives can't hold us back anymore haha.

−9

mocny-chlapik t1_iqlt8jo wrote

It's about the speed of computation, not about the complexity of definition. If you need to calculate the function million or even billion times for each sample, it makes sense to optimize it.

22

cthorrez OP t1_iqmv564 wrote

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.

−1