Submitted by cthorrez t3_xsq40j in MachineLearning
cthorrez OP t1_iqmv564 wrote
Reply to comment by mocny-chlapik in [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.
Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.
Viewing a single comment thread. View all comments