madhatter09

madhatter09 t1_j0jtxaq wrote

There are several papers on this idea - the best one is probably On Calibration of Modern Neural Networks by Guo et al. The gist is that you want your softmax output to be the same as the probability of your prediction being correct. For your architecture they do this through something that is called temperature scaling. Why this works is a more involved topic, but you can get a better handle of what the consequences are of cross entropy using hard (1s and 0s) vs soft labels (not so much a 1 or 0).

I think then going into OOD as the others suggest would be more fruitful. The whole deal with distribution shift and then the extremes of OOD, gets very murky and detached from what happens in practice, but ultimately the goal is to have the ability to know that a mismatch of input to model is happening, vice just having low confidence.

24