Dartagnjan OP t1_j103e5a wrote on December 20, 2022 at 6:17 PM

Reply to comment by trajo123 in [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan

Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...

> focal loss, hard-example mining

I think these are exactly the keywords that I was missing in my search.

dumbmachines t1_j133fcs wrote on December 21, 2022 at 8:46 AM

If focal loss is interesting, check out polyloss, which is a generalization of the focal loss idea.