AristocraticOctopus

AristocraticOctopus t1_iqv8dyh wrote

I understand what you're confused about, I had the same confusion initially. Here's how I think about it:

Say we have 3 input variables (x1,x2,x3) and a fully connected network (FCN).

The first hidden layer is a fixed function of the inputs:

h = f(x1,x2,x3)

no matter what the values of x1,2,3 are.

Let's say the first dimension of h, h1, models some semantic feature A, which is always a functions of x1 and (something else). Let's also say for this batch of inputs, there's some important relationship between x1 and x3, and x2 really doesn't matter. The network will update towards decreasing the magnitude of the x2-multiplying component of the weight matrix, w2.

Now for our next set of inputs (x'1,x'2,x'3), the important set of relationships to determine semantic feature A, is between x'1 and x'2 - uh oh! If we just run this through the network again the same way, we'll decrease the magnitude on w3 and increase it on w1 and w2, which sorta negates a bit of our previous update. The problem is that the transformation depends on index in a fixed way, and data in a variable (particularly, on magnitude) way - but we would like the index dependence to vary based on the data.

What we need is to make the transformation itself data-dependent. It's a bit like lifting a function to a functional, if that analogy makes any sense (someone jump in if that's a horrible analogy). I believe this is the "fast weight" perspective on Transformers - the attention weights don't directly compute dot products on input features - they program "virtual weights" (i.e. Q,K,V) which have index-data dependence, and those "virtual weights" get transformed in the usual way to produce the output features.

It's true that FCNs can represent any fixed transformation with their learned weights - but we don't want fixed transformations, we want a different transformation depending on the data/index interdependence, and that's what attention gets us.

13