Submitted by neuralbeans t3_10puvih in deeplearning
emilrocks888 t1_j6mjf7m wrote
I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.
neuralbeans OP t1_j6mjhog wrote
What's this about del attention?
emilrocks888 t1_j6mjnk7 wrote
Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)
Viewing a single comment thread. View all comments