Submitted by neuralbeans t3_10puvih in deeplearning

I'd like to train a neural network where the softmax output has a minimum possible probability. During training, none of the probabilities should go below this minimum. Basically I want to avoid the logits from becoming too different from each other so that none of the output categories are ever completely excluded in a prediction, a sort of smoothing. What's the best way to do this during training?

6

Comments

You must log in or register to comment.

like_a_tensor t1_j6mcv1v wrote

I'm not sure how to fix a minimum probability, but you could try softmax with a high temperature.

2

Lankyie t1_j6mf6pt wrote

max[softmax, lowest accepted probability]

1

FastestLearner t1_j6mhjd2 wrote

Use composite loss, i.e. add extra terms in the loss function to make the optimizer force the logits to stay within a fixed range.

For example, if current min logit = m and allowed minimum = u, current max logit = n and allowed maximum = v, then the following loss function should help:

Overall loss = CrossEntropy loss + lambda1 * max(u - m, 0) and lambda2 * max(n - v, 0)

The max terms ensure that no loss is added when the logits are all within the allowed range. Use lamba1 and lambda2 to scale each term so that they roughly match the CE loss in strength.

5

emilrocks888 t1_j6mjf7m wrote

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

1

chatterbox272 t1_j6myph4 wrote

If the goal is to keep all predictions above a floor, the easiest way is to make the activation into floor + (1 - floor * num_logits) * softmax(logits). This doesn't have any material impact on the model, but it imposes a floor.

If the goal is to actually change something about how the predictions are made, then adding a floor isn't going to be the solution though. You could modify the activation function some other way (e.g. by scaling the logits, normalising them, etc.), or you could impose a loss penalty for the difference between the logits or the final predictions.

1

nutpeabutter t1_j6n2eaf wrote

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

1

chatterbox272 t1_j6n3vx6 wrote

My proposed function does that. Let's say you have two outputs, and don't want either to go below 0.25. Your minimum value already adds up to 0.5, so you rescale the softmax to add up to 0.5 as well, giving you a sum of 1 and a valid distribution.

2

No_Cryptographer9806 t1_j6nfqhq wrote

I am curious why do you want to do that? You can always post process the logits but forcing the Network to learn it will cause harm to the underlying representation imo

1