Submitted by 029187 t3_xtzmi2 in MachineLearning
So an attention layer has a Q, K, and V vector My understanding is the goal is to say for a given query q, how relevant is the value v.
From this the network learns which data is relevant to focus on for a given input.
But what I don't get is why this is effective. Don't DNNs already do this with weights? A neuron in a hidden layer can be set off by any arbitrary combination of inputs, so in principle something like attention should be able to naturally emerge inside of a DNN. For example, image recognition neural network may learn to focus on specific patterns of pixels and ignore others.
Why does hard coding this mechanism into the model so much benefit?
hellrail t1_iqsvjt7 wrote
Transformers are graph networks applied on graph data, CNNs do not operate on graph data