Submitted by minimaxir t3_11fbccz in MachineLearning
LetterRip t1_javpxbv wrote
Reply to comment by CellWithoutCulture in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit
GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).
Flash attention is June of 2022.
Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.
https://arxiv.org/abs/2208.07339
> That seems large, which paper has that?
See
https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg
>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.
CellWithoutCulture t1_javqw9s wrote
Fantastic reply, it's great to see all those concrete advances thst made it intro prod. Thanks for sharing.
Viewing a single comment thread. View all comments