emilrocks888
emilrocks888 t1_j6mjf7m wrote
Reply to Best practice for capping a softmax by neuralbeans
I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.
emilrocks888 t1_j6mjnk7 wrote
Reply to comment by neuralbeans in Best practice for capping a softmax by neuralbeans
Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)