I'd like to train a neural network where the softmax output has a minimum possible probability. During training, none of the probabilities should go below this minimum. Basically I want to avoid the logits from becoming too different from each other so that none of the output categories are ever completely excluded in a prediction, a sort of smoothing. What's the best way to do this during training?

Comments

You must log in or register to comment.

like_a_tensor t1_j6mcv1v wrote on January 31, 2023 at 10:16 AM

#1,653,118

I'm not sure how to fix a minimum probability, but you could try softmax with a high temperature.

neuralbeans OP t1_j6md46u wrote on January 31, 2023 at 10:20 AM

#1,653,178

Replying to like_a_tensor (#1,653,118)

That will just make the model learn larger logits to undo the effect of the temperature.

Lankyie t1_j6mf6pt wrote on January 31, 2023 at 10:49 AM

#1,653,726

max[softmax, lowest accepted probability]

FastestLearner t1_j6mhjd2 wrote on January 31, 2023 at 11:20 AM

#1,654,398

Use composite loss, i.e. add extra terms in the loss function to make the optimizer force the logits to stay within a fixed range.

For example, if current min logit = m and allowed minimum = u, current max logit = n and allowed maximum = v, then the following loss function should help:

Overall loss = CrossEntropy loss + lambda1 * max(u - m, 0) and lambda2 * max(n - v, 0)

The max terms ensure that no loss is added when the logits are all within the allowed range. Use lamba1 and lambda2 to scale each term so that they roughly match the CE loss in strength.

neuralbeans OP t1_j6miw6o wrote on January 31, 2023 at 11:36 AM

#1,654,772

Replying to Lankyie (#1,653,726)

It needs to remain a valid softmax distribution.

emilrocks888 t1_j6mjf7m wrote on January 31, 2023 at 11:42 AM

#1,654,941

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

neuralbeans OP t1_j6mjhog wrote on January 31, 2023 at 11:43 AM

#1,654,972

Replying to emilrocks888 (#1,654,941)

What's this about del attention?

emilrocks888 t1_j6mjnk7 wrote on January 31, 2023 at 11:45 AM

#1,655,014

Replying to neuralbeans (#1,654,972)

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)

Lankyie t1_j6mjvpy wrote on January 31, 2023 at 11:48 AM

#1,655,072

Replying to neuralbeans (#1,654,772)

yeah true, you can implement that by factoring everything back to the sum of 1 though

chatterbox272 t1_j6myph4 wrote on January 31, 2023 at 2:05 PM

#1,659,713

If the goal is to keep all predictions above a floor, the easiest way is to make the activation into floor + (1 - floor * num_logits) * softmax(logits). This doesn't have any material impact on the model, but it imposes a floor.

If the goal is to actually change something about how the predictions are made, then adding a floor isn't going to be the solution though. You could modify the activation function some other way (e.g. by scaling the logits, normalising them, etc.), or you could impose a loss penalty for the difference between the logits or the final predictions.

neuralbeans OP t1_j6n0ima wrote on January 31, 2023 at 2:18 PM

#1,660,290

Replying to chatterbox272 (#1,659,713)

I want the output to remain a proper distribution.

nutpeabutter t1_j6n2eaf wrote on January 31, 2023 at 2:32 PM

#1,660,924

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

chatterbox272 t1_j6n3vx6 wrote on January 31, 2023 at 2:43 PM

#1,661,412

Replying to neuralbeans (#1,660,290)

My proposed function does that. Let's say you have two outputs, and don't want either to go below 0.25. Your minimum value already adds up to 0.5, so you rescale the softmax to add up to 0.5 as well, giving you a sum of 1 and a valid distribution.

No_Cryptographer9806 t1_j6nfqhq wrote on January 31, 2023 at 4:01 PM

#1,665,223

I am curious why do you want to do that? You can always post process the logits but forcing the Network to learn it will cause harm to the underlying representation imo

neuralbeans OP t1_j6nmccc wrote on January 31, 2023 at 4:42 PM

#1,667,242

Replying to No_Cryptographer9806 (#1,665,223)

It's for reinforcement learning to keep the model exploring possibilities.

_vb__ t1_j6ocec9 wrote on January 31, 2023 at 7:21 PM

#1,675,625

Replying to neuralbeans (#1,653,178)

No, it would make the logits be closer to one another and the overall model a bit less confident in its probabilities.