Viewing a single comment thread. View all comments

visarga t1_j0k46hz wrote

It's a hard problem, nobody has a definitive solution. From my lectures and experience:

  • interval calibration (easy)

  • temperature scaling (easy)

  • ensembling (expensive)

  • Monte Carlo dropout (not great)

  • using prior networks or auxiliary networks (for OOD detection)

  • error correcting output codes (ECOC)

  • conformal prediction (slightly different task than confidence estimation)

Here's my Zeta-Alpha confidence estimation paper feed.

6

vwings t1_j0l5778 wrote

A hard problem indeed. The methods in your list have use different settings. Deep Ensembles, MC Dropout don't require a calibration set. The prior networks (i love this paper) assume that during training OOD samples are available. Conformal prediction assumes the availability of a calibration set that follows the distribution of future data... For the other methods,I would have to check ...

2

zeyus t1_j0kfuu5 wrote

Quick question. Wouldn't a simple solution be to include a 'neither'/'other' output class?

Given that a network should classify an image as a dog or a cat, in reality a lot of use cases actually want a multi-class prediction rather than binary, because a picture of a monkey should not be a dog or a cat. Just on a hunch I would guess the performance goes down significantly and obviously requires more training data.

1

trajo123 t1_j0kullx wrote

Yes, that's an option but you have absolutely no guarantees about it's ability to produce anything meaningful. What images to you introduce in the "other" class? There are infinitely more images falling in the other category than there are cat-or-dog images. For any training set you come up with for the "other" class, the model can still be tested with an image totally different from your training set, and the model output will have no reason what-so-ever to favour "other" for the new image.

4

visarga t1_j0m7kn4 wrote

I can confirm this. I did NER and most tokens are not names entities, so they are "other". It's really hard to define what "other" means, even with lots of text the model is unsure. No matter how much "other" I provide, I couldn't train a negative class properly.

2

zeyus t1_j0kxp2n wrote

True, the thought did occur to me, but I thought you could train the other category with a diverse set of animals and also people, nature, cars, landscapes etc. While there are a larger infinite set of "non-dog" or "non-cat" images, it must be possible to classify features that absolutely don't indicate a dog or cat...I don't think it's the most effective method perhaps...though it would be interesting to give it a go, maybe after my exams I'll try...

I can't shake the feeling that it might be somehow informative on the classification layer, either for reducing the confidence of the other categories or weighting it somehow

1

trajo123 t1_j0l5hwj wrote

You will get some results, for sure. Depending on your application may even be good enough. But as a general probability that an image is something other than cat and dog, not really.
As other commenters have mentioned the general problem is known as OOD (out of distribution) sample detection. There are Deep Learning models which model probabilities explicitly and can in principle used for OOD sample detection - Variational Autoencoders. The original formulation of this model performs poorly in practice at OOD sample detection, but there is work addressing some shortcomings, for instance Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation. But with VAEs things get very mathematical, very fast.
Coming back to you initial question, no, softmax is not appropriate for "confidence", but this is an open problem in Deep Learning.

1

visarga t1_j0m7wj7 wrote

How about a bronze statue of a dog, a caricature of a cat, a fantasy image that is hard to classify, etc? Are they "other"?

1