Submitted by 029187 t3_xtzmi2 in MachineLearning

So an attention layer has a Q, K, and V vector My understanding is the goal is to say for a given query q, how relevant is the value v.

From this the network learns which data is relevant to focus on for a given input.

But what I don't get is why this is effective. Don't DNNs already do this with weights? A neuron in a hidden layer can be set off by any arbitrary combination of inputs, so in principle something like attention should be able to naturally emerge inside of a DNN. For example, image recognition neural network may learn to focus on specific patterns of pixels and ignore others.

Why does hard coding this mechanism into the model so much benefit?

45

Comments

You must log in or register to comment.

hellrail t1_iqsvjt7 wrote

Transformers are graph networks applied on graph data, CNNs do not operate on graph data

0

suflaj t1_iqsw33w wrote

The self-attention mechanism evaluates relationships within the inputs. DNNs evaluate relationship between the input and the weights in the layer. Self-attention outputs the relationship between the inputs. DNNs just output the input transformed into another hyperspace.

30

029187 OP t1_iqsyxb4 wrote

The relationships between inputs are just mathematical functions. In principle, the DNN could also arrive at those functions. For example, if we imagine a dense network with >1 hidden layer, the first hidden layer is just looking at inputs and and their weights. But the subsequent layers are looking at combinations of the inputs and their weights, which could in principle be used to identify relationships. The more layers and nodes, the more complex the relationship. As DNNs are universal approximators, this must be true. (Although, clearly just because something can be approximated in theory doesn't mean the DNN will actually converge to it via backprop)

Clearly though in a lot of use cases the attention network is converging faster and more accurately.

Has there been a lot of research on what in particular allows the attention layers to achieve this?

8

029187 OP t1_iqsz9t7 wrote

Yeah I get why the non-locality is useful, as CNNs group data locally, which doesn't make sense in graph data (the relevant word could be very far away in the sentence)

But a densely connected deep neural network already should have what it needs to map out any arbitrary function relating nodes on a graph.

4

suflaj t1_iqt1v5b wrote

> In principle, the DNN could also arrive at those functions.

With a very sparse deep FC network - sure. But in practice it will not happen to any extent it is done in self-attention. In practice it is hard to even reduce the number of transformer blocks and imitate at certain checkpoints, let alone emulate a self-attention with any number of FC layers in series.

You are completely disregarding that just because it is possible to define a mathematical approximation it doesn't mean that there is an algorithm which can consistently lead the weights to it. Certainly in terms of self-attention the landscape optimization algorithms traverse is not really well-behaved.

Theory mostly doesn't matter for deep learning because the subject is not even explainable by theory for the most part. Bunch of trial and error, especially with transformers. There are many parts of DL which are in contradiction with ML theory when applied in practice. Unless you have proof that guarantees something, I'd recommend to avoid trying to apply theory without practice, it's probably a waste of time. There are many theoretical claims, like the universal approximation theorem, but they do not really hold up in practice for the resources we have.

12

029187 OP t1_iqt44bx wrote

>You are completely disregarding that just because it is possible to define a mathematical approximation it doesn't mean that there is an algorithm which can consistently lead the weights to it. Certainly in terms of self-attention the landscape optimization algorithms traverse is not really well-behaved.

Yeah I was more just trying to understand if there is a theoretical understanding of why the weights are not led to it via backprop. I 100% agree with your point though. Just because something CAN approximate doesn't mean there is an optimization algorithm that will lead to that approximation. If that were the case, every universal approximator would be as good as every other, which is clearly not the case.

​

>Theory mostly doesn't matter for deep learning because the subject is not even explainable by theory for the most part.

This is an interesting take. Can you elaborate a bit?

5

suflaj t1_iqt5hll wrote

>if there is a theoretical understanding of why the weights are not led to it via backprop

Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.

>This is an interesting take. Can you elaborate a bit?

I'm not sure how I could elaborate on this. If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.

You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.

9

029187 OP t1_iqt73b1 wrote

>Not really. The intuition is that self-attention is a vastly different kernel than FC layers can handle. Especially with the whole dot product which I assume is the main culprit for it.

Interesting! That's good to know. I wonder with different optimizers if it will be possible in the future.

​

>If you read papers you will see that most of the stuff in DL has barely any theoretical basis. On the topic of transformers, about the most theoretical part of it is the normalization in self-attention scores (square root of d_k). Everything else in the original paper is mostly shooting in the dark. It's even more of a joke when you realize that they didn't even check different seeds to realize the one in the original paper gave them fairly bad results.
>
>You can also check all the different transformer architectures that can't seem to converge into anything since the foundation for them is so bad and non-scientific, I'd dare say arbitrary. And then just as you think maybe you can get more hope with CNNs which aren't so arbitrary, you're met with a slightly different residual block in convnext that supposedly gives you results comparable to vision transformers, yet there is barely any theoretical basis over it, mostly intuition.

This was actually a very good elaboration, thank you. Keep in mind, to you this was probably obvious, but to other folks like me this is very insightful.

9

suflaj t1_iqt971l wrote

Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers. It truly feels like the only difference between SOTA and a garbage paper is that SOTA somehow got to work on a specific machine, specific setup and specific training run. And this spills into whole of DL.

Hopefully you will not have the misfortune of trying to replicate some of the papers that either don't have a repo linked or which are not maintained by a large corporation, you might understand better what I meant.

12

029187 OP t1_iqtcofx wrote

>Ah sorry, based on your responses I was convinced you were reading papers so my response might have been overly aggressive due on the incredibly negative experience I have had while reading relevant DL papers.

Its all good. I'm happy to hear your thoughts.

I've read some papers but I'm by no means an expert. Ironically I've actually used ML in a professional setting, but most of my work is very much "let's run some models and use the most accurate one". Generally squeezing an extra percent via SOTA models is not worth it, so I don't deal with them much.

I do try to keep up to date with latest models, but it all seems so trial-and-error, which I think is what you were getting at.

In addition, there is a lot of incorrect theory out there which makes it even harder for amateurs or semi-pros like me. I still see videos on YouTube to this day claiming DNNs are effective because they are universal approximators, which is clearly not the reason, since there are tons of universal approximator models besides DNNs that cannot be trained as effectively on problem sets like image recognition or NLP. Universal Approximation is likely necessary but almost certainly not sufficient.

I've been reading papers like the lotter-ticket-hypothesis which seem like they are trying to give some insight into why DNNs are a useful architecture, as well as Google's follow-up paper about rigging the lottery.

Those papers have gotten me pretty interested into reading up on why these models work so well, but it seems that when you look into it the results are as you've said, it's a lot of trial and error without much of a theoretical underpinning. Of course, I'm not expert, so I don't want to poo poo the work that a lot of very smart and experienced folks are doing.

5

pia322 t1_iqtfrt3 wrote

I really like this question. I agree with you that a NN is an arbitrary function approximator, and it could easily implicitly learn the attention function.

I personally embrace the empiricism. We try to make theoretical justifications, but in reality, attention/transformers just happen to work better, and no one really knows why. One could argue that 95% of deep learning research follows this empirical methodology, and the "theory" is an afterthought to make the papers sound nicer.

Why is ResNet better than VGG? Or ViT better than ResNet? They're all arbitrary function approximators, so they should all be able to perform identically well. But empirically, that's not the case.

5

HjalmarLucius t1_iqtl4yv wrote

Attention introduces multiplicative relationships, i.e. x*y whereas ordinary operations only have additive relationships.

9

TheJulianInside t1_iquno3u wrote

I like the geometric interpretation. It's not exactly the theory that explains it all, but I find it a very useful framework for thinking about these type of questions.

The main way I understand it is in terms of the space of possible functions that a model has to consider. Also closely related to the "curse of dimensionality" - given that data is finite and cannot fill the space in all dimensions - a generic universal function approximator will never have dense enough input space to learn useful representations. So geometric priors are necessary to reduce the space of functions.

I'm fascinated by the work of Michael Bronstein and friends.

Note: I'm very far from an expert on any of this

2

Nameless1995 t1_iqus5e6 wrote

DNN weights are static (same for all inputs). Attention weights are dynamic (input-dependent). In this sense, attention weights are sorts of "fast weights".

5

graphicteadatasci t1_iqv0m4y wrote

This the one. A DNN may be a universal function approximator but only if data and n_parameters is infinite. When we have infinite data we can learn y as parameters and when we multiply the parameters with x we get x*y. But we don't have infinite data / infinite parameters and even if we did we don't have a stable method for training infinitely. So we need other stuff.

3

AristocraticOctopus t1_iqv8dyh wrote

I understand what you're confused about, I had the same confusion initially. Here's how I think about it:

Say we have 3 input variables (x1,x2,x3) and a fully connected network (FCN).

The first hidden layer is a fixed function of the inputs:

h = f(x1,x2,x3)

no matter what the values of x1,2,3 are.

Let's say the first dimension of h, h1, models some semantic feature A, which is always a functions of x1 and (something else). Let's also say for this batch of inputs, there's some important relationship between x1 and x3, and x2 really doesn't matter. The network will update towards decreasing the magnitude of the x2-multiplying component of the weight matrix, w2.

Now for our next set of inputs (x'1,x'2,x'3), the important set of relationships to determine semantic feature A, is between x'1 and x'2 - uh oh! If we just run this through the network again the same way, we'll decrease the magnitude on w3 and increase it on w1 and w2, which sorta negates a bit of our previous update. The problem is that the transformation depends on index in a fixed way, and data in a variable (particularly, on magnitude) way - but we would like the index dependence to vary based on the data.

What we need is to make the transformation itself data-dependent. It's a bit like lifting a function to a functional, if that analogy makes any sense (someone jump in if that's a horrible analogy). I believe this is the "fast weight" perspective on Transformers - the attention weights don't directly compute dot products on input features - they program "virtual weights" (i.e. Q,K,V) which have index-data dependence, and those "virtual weights" get transformed in the usual way to produce the output features.

It's true that FCNs can represent any fixed transformation with their learned weights - but we don't want fixed transformations, we want a different transformation depending on the data/index interdependence, and that's what attention gets us.

13

firejak308 t1_iqvqnii wrote

Thanks for this explanation! I've heard the general reasoning that "transformers have variable weights" before, but I didn't quite understand the significance of that until you provided the concrete example of relationships between x1 and x3 in one input, versus x1 and x2 in another input.

2

recordertape t1_iqvs43g wrote

While I have no idea about the low-level impact of attention, I'd like to note that (at least for image recognition) a lot of (transformer) architecture papers achieve SOTA results/improvements due to superior data augmentation & losses instead of architectural advances (although they might write a whole paper about the architecture and only mention the training tricks in supplementary...) . Many transformer papers use quite complicated setups with RandAug/CutMix augmentations, distillation losses, EMA weights etc. "ResNet strikes back" shows that ResNet accuracy is significantly boosted by using similar training pipelines and ConvNext achieves results similar to transformers. While I'm not an expert in DNN architectures, I'd guess some hybrid with interleaved conv/transformer layers could be optimal, with the conv layers extracting local features and transformer layers for long-range relationships. Probably something like MobileViT. But if I'd have to pick something for production/prototyping now it'd just be a ResNet. Well-supported and optimized in libraries/hardware and forgiving to train.

2

Desperate-Whereas50 t1_iqwzlgc wrote

I am not a transformer expert. So maybe this is a stupid question, but is this also true for transformer based architectures? For example BERT uses 12/24 transformer Blocks. Thats sounds not as deep as for example a resnet-256.

1