Viewing a single comment thread. View all comments

CellWithoutCulture t1_javhjpc wrote

I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

> memory efficient attention. 10x-20x increase in batch size.

That seems large, which paper has that?

1

LetterRip t1_javpxbv wrote

> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).

Flash attention is June of 2022.

Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.

https://arxiv.org/abs/2208.07339

> That seems large, which paper has that?

See

https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg

>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.

https://github.com/HazyResearch/flash-attention

1

CellWithoutCulture t1_javqw9s wrote

Fantastic reply, it's great to see all those concrete advances thst made it intro prod. Thanks for sharing.

1