Submitted by Lugi t3_xt01bk in MachineLearning
The equation of α-balanced focal loss (binary in this case for simplicity) is given by:
What puzzles me is that it seems like weighing used here is opposite to what is intuitive when dealing with imbalanced datasets: normally you would scale the loss of class 1 (minority - foreground objects in case of object detection) higher than the class 0 (majority - background). However what happens here is that we scale class 1 by 0.25, and class 0 by 0.75.
Is this behavior explained anywhere? I don't think I'm getting the foreground/background labels wrong, as I've looked into multiple implementations, as well as the original paper. Or maybe am I missing some crucial detail?
Paper for reference: https://arxiv.org/abs/1708.02002
you-get-an-upvote t1_iqnunxf wrote
The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).
But also I want to take this moment to talk about focal loss.
The point of focal loss really isn't downweighting common classes. Note that the original definition of focal loss in the paper doesn't use α. The formula you give is the "α-balanced variant of focal loss" which the authors "adopt in [their] experiments as it yields slightly improved accuracy over the non-α-balanced form".
What focal loss does do is decrease the importance of "easy" examples on the loss -- that is, it decreases the importance of examples that the model gets very correct. When datasets are imbalanced, common classes tend to be "easy" in this sense.
For example, consider a class that is 99% classA and 1% classB. A trivial model will predict every datapoint has a 99% chance of being classA, which will result in a very low loss for classA datapoints and a very high loss for classB datapoints.
Note, though, that these are not the same thing, since the more common class doesn't have to be the easier one. Suppose I train a model on CIFAR10 but add an additional "image is a solid color" class. Even if this extra class has only 10% of the datapoints of the other classes, it's so easy to classify compared to the other classes that focal loss will assign it lower weight.