you-get-an-upvote t1_iqnunxf wrote on October 1, 2022 at 7:47 PM

The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

But also I want to take this moment to talk about focal loss.

The point of focal loss really isn't downweighting common classes. Note that the original definition of focal loss in the paper doesn't use α. The formula you give is the "α-balanced variant of focal loss" which the authors "adopt in [their] experiments as it yields slightly improved accuracy over the non-α-balanced form".

What focal loss does do is decrease the importance of "easy" examples on the loss -- that is, it decreases the importance of examples that the model gets very correct. When datasets are imbalanced, common classes tend to be "easy" in this sense.

For example, consider a class that is 99% classA and 1% classB. A trivial model will predict every datapoint has a 99% chance of being classA, which will result in a very low loss for classA datapoints and a very high loss for classB datapoints.

Note, though, that these are not the same thing, since the more common class doesn't have to be the easier one. Suppose I train a model on CIFAR10 but add an additional "image is a solid color" class. Even if this extra class has only 10% of the datapoints of the other classes, it's so easy to classify compared to the other classes that focal loss will assign it lower weight.

killver t1_iqo7jqh wrote on October 1, 2022 at 9:21 PM

But that's the opposite as most implementations do it like OP mentions: http://pytorch.org/vision/stable/_modules/torchvision/ops/focal_loss.html#sigmoid_focal_loss

Or do I get it wrong?

Lugi OP t1_iqoa9pp wrote on October 1, 2022 at 9:41 PM

>The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

They say it CAN be set like that, but they explicitly set it to 0.25. This is why I am confused, they put that statement in and did something completely opposite.

chatterbox272 t1_iqp67eq wrote on October 2, 2022 at 1:55 AM

It is most likely because the focal term ends up over-emphasizing the rare class term for their task. The focal loss up-weights hard samples (most of which will usually be the rare/object class) and down-weights easy samples (background/common class). The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore. They inherit the nomenclature from cross entropy, but they use the term in a different way and are clear as mud about it in the paper.

I_draw_boxes t1_iqvuh8g wrote on October 3, 2022 at 2:03 PM

>The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore.

This is it. The background in RetinaNet far exceeds foreground so the default prediction of the network will be background which generates very little loss per anchor in their formulation. Focal loss without alpha is symmetrical, but the targets and behavior of RetinaNet is not.

Alpha might be intended to bring up the loss for common negative examples to keep it in balance with foreground loss. It might also be intended to bring up the loss for false positives which are even more rare than foreground.