Dartagnjan OP t1_j103e5a wrote
Reply to comment by trajo123 in [D] Techniques to optimize a model when the loss over the training dataset has a Power Law type curve. by Dartagnjan
Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...
> focal loss, hard-example mining
I think these are exactly the keywords that I was missing in my search.
dumbmachines t1_j133fcs wrote
If focal loss is interesting, check out polyloss, which is a generalization of the focal loss idea.
Viewing a single comment thread. View all comments