Viewing a single comment thread. View all comments

neuralbeans t1_j7f68rv wrote

Parameters are a tiny portion of the values in GPU. The number of activations grows quadratically with sequence size.

7

beautyofdeduction OP t1_j7hkb7q wrote

Yes, that's true. But even adding that in (6250*6250 ~= 40 mil floats), we are still nowhere near 40G.

1

neuralbeans t1_j7jdiqz wrote

A sequence length of 6250 is massive! It's not just 6250*6250 since you're not multiplying one float per pair of sequence items. You're multiplying the key and value vectors together per pair of sequence items, and this is done for every attention head (in parallel). I think you're seriously under estimating the problem.

What transformer is this which accepts a sequence length of 6250?

1

beautyofdeduction OP t1_j7jqohn wrote

I wish I can send you my Github. But the original Attention is All You Need paper trained on sequences of length 25000 on multiple K80's (stated by the authors), which has only 12GB vram. Yes they used multiple GPUs, but afaik each GPU needs to be able to handle its own batch. Or maybe not? Again I wish I could show you my code.

1