Viewing a single comment thread. View all comments

emilrocks888 t1_j6mjf7m wrote

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

1

neuralbeans OP t1_j6mjhog wrote

What's this about del attention?

1

emilrocks888 t1_j6mjnk7 wrote

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)

1