Viewing a single comment thread. View all comments

Lugi OP t1_iqoa9pp wrote

>The alpha used in the paper is the inverse of the frequency of the class. So class1 is scaled by 4 (i.e. 1 / 0.25) and class2 is scaled by 1.33 (1/0.75).

They say it CAN be set like that, but they explicitly set it to 0.25. This is why I am confused, they put that statement in and did something completely opposite.

3

chatterbox272 t1_iqp67eq wrote

It is most likely because the focal term ends up over-emphasizing the rare class term for their task. The focal loss up-weights hard samples (most of which will usually be the rare/object class) and down-weights easy samples (background/common class). The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore. They inherit the nomenclature from cross entropy, but they use the term in a different way and are clear as mud about it in the paper.

6

I_draw_boxes t1_iqvuh8g wrote

>The alpha term is therefore being set to re-adjust the background class back up, so it doesn't become too easy to ignore.

This is it. The background in RetinaNet far exceeds foreground so the default prediction of the network will be background which generates very little loss per anchor in their formulation. Focal loss without alpha is symmetrical, but the targets and behavior of RetinaNet is not.

Alpha might be intended to bring up the loss for common negative examples to keep it in balance with foreground loss. It might also be intended to bring up the loss for false positives which are even more rare than foreground.

2