[D] - Why do Attention layers work so well? Don't weights in DNNs already tell the network how much weight/attention to give to a specific input? (High weight = lots of attention, low weight = little attention) Submitted by 029187 t3_xtzmi2 on October 2, 2022 at 8:56 PM in MachineLearning 26 comments 45
Nameless1995 t1_iqus5e6 wrote on October 3, 2022 at 6:39 AM DNN weights are static (same for all inputs). Attention weights are dynamic (input-dependent). In this sense, attention weights are sorts of "fast weights". Permalink 5
Viewing a single comment thread. View all comments