The use of log loss is related to the maximum entropy principle, which states that the loss should make as few assumptions as possible about the actual distribution of the data. For example, if you only know that your problem has two classes, your loss should make no further assumptions. In the case of binary classification, the mathematical formula derived from this loss principle is the sigmoid function. You can learn more about it with this short article https://github.com/WinVector/Examples/raw/main/dfiles/LogisticRegressionMaxEnt.pdf
percevalw t1_iqn20u4 wrote
Reply to [D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez
The use of log loss is related to the maximum entropy principle, which states that the loss should make as few assumptions as possible about the actual distribution of the data. For example, if you only know that your problem has two classes, your loss should make no further assumptions. In the case of binary classification, the mathematical formula derived from this loss principle is the sigmoid function. You can learn more about it with this short article https://github.com/WinVector/Examples/raw/main/dfiles/LogisticRegressionMaxEnt.pdf