Viewing a single comment thread. View all comments

emilrocks888 t1_j6mjf7m wrote on January 31, 2023 at 11:42 AM

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

neuralbeans OP t1_j6mjhog wrote on January 31, 2023 at 11:43 AM

What's this about del attention?

emilrocks888 t1_j6mjnk7 wrote on January 31, 2023 at 11:45 AM

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)