cthorrez OP t1_iqmv564 wrote on October 1, 2022 at 3:35 PM

Reply to comment by mocny-chlapik in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.