Viewing a single comment thread. View all comments

nutpeabutter t1_j6n2eaf wrote

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

1