Viewing a single comment thread. View all comments

nutpeabutter t1_j6n2eaf wrote on January 31, 2023 at 2:32 PM

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop