Viewing a single comment thread. View all comments

Thunderbird120 t1_jakbyew wrote

You're better qualified to know than nearly anyone who posts here, but is flash attention really all that's necessary to make that feasible?

24

lucidraisin t1_jakdtf7 wrote

yes

edit: it was also used to train Llama. there is no reason not to use it at this point, for both training and fine-tuning / inference

46

fmai t1_jalcs0x wrote

AFAIK, flash attention is just a very efficient implementation of attention, so still quadratic in the sequence length. Can this be a sustainable solution for when context windows go to 100s of thousands?

14

lucidraisin t1_jamtx7b wrote

it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.

14

LetterRip t1_janljeo wrote

Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.

3

visarga t1_jalg9iu wrote

I think the main pain point was memory usage.

6

Dekans t1_jamokhr wrote

> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

...

> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.

4

Hsemar t1_jalp8as wrote

but does flash attention help with auto-regressive generation? My understanding was that it prevents materializing the large kv dot product during training. At inference (one token at a time) with kv caching this shouldn't be that relevant right?

0