Submitted by minimaxir t3_11fbccz in MachineLearning
LetterRip t1_jaj1kp3 wrote
> I have no idea how OpenAI can make money on this.
Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.
A*.3/3 = 10% of the cost.
Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.
So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.
edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation
https://github.com/microsoft/DeepSpeed-MII
Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.
Thunderbird120 t1_jajok9y wrote
I'm curious which memory efficient transformer variant they've figured out how to leverage at scale. They're obviously using one of them since they're offering models with 32k context but it's not clear which one.
lucidraisin t1_jakb7h4 wrote
Thunderbird120 t1_jakbyew wrote
You're better qualified to know than nearly anyone who posts here, but is flash attention really all that's necessary to make that feasible?
lucidraisin t1_jakdtf7 wrote
yes
edit: it was also used to train Llama. there is no reason not to use it at this point, for both training and fine-tuning / inference
fmai t1_jalcs0x wrote
AFAIK, flash attention is just a very efficient implementation of attention, so still quadratic in the sequence length. Can this be a sustainable solution for when context windows go to 100s of thousands?
lucidraisin t1_jamtx7b wrote
it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.
LetterRip t1_janljeo wrote
Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.
visarga t1_jalg9iu wrote
I think the main pain point was memory usage.
Dekans t1_jamokhr wrote
> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.
...
> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.
Hsemar t1_jalp8as wrote
but does flash attention help with auto-regressive generation? My understanding was that it prevents materializing the large kv dot product during training. At inference (one token at a time) with kv caching this shouldn't be that relevant right?
[deleted] t1_jarikhz wrote
[deleted]
minimaxir OP t1_jajcf4s wrote
It's safe to assume that some of those techniques were already used in previous iterations of GPT-3/ChatGPT.
LetterRip t1_jajezib wrote
June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.
Memory efficient attention is from a few months ago.
ChatGPT was just introduced a few months ago.
The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.
jinnyjuice t1_jalkbvu wrote
How do we know these technical improvements result in 90% extra revenue? I feel I'm missing some link here.
andreichiffa t1_jajuk03 wrote
That, and the fact that OpenAI/MS want to completely dominate LLM market, in the same way Microsoft dominated OS/browser market in the late 90s/early 2000s.
Smallpaul t1_jam6et8 wrote
They’ll need a stronger story around lock-in if that’s their strategy. One way would be to add structured and unstructured data storage to the APIs.
bjergerk1ng t1_jakszgr wrote
Is it possible that they also switched from non-chinchilla-optimal davinci to chinchilla-optimal chatgpt? That is at least 4x smaller
LetterRip t1_jal4y8i wrote
Certainly that is also a possibility. Or they might have done teacher student distillation.
cv4u t1_jakzhqj wrote
LLMs can quantize to 8 bit or 4 bit?
LetterRip t1_jal4vgs wrote
Yep, or a mix between the two.
GLM-130B quantized to int4, OPT and BLOOM int8,
https://arxiv.org/pdf/2210.02414.pdf
Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),
Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS
https://arxiv.org/pdf/2103.13630.pdf
Here is a talk on auto48 (automatic mixed int4/int8 quantization)
https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/
londons_explorer t1_jam6oyr wrote
Aren't biases only a tiny tiny fraction of the total memory usage? Is it even worth trying to quantize them more than weights?
londons_explorer t1_jam6r8g wrote
Don't you mean the other way around?
tomd_96 t1_jamp6kt wrote
Where was this introduced?
CellWithoutCulture t1_javhjpc wrote
I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit
> memory efficient attention. 10x-20x increase in batch size.
That seems large, which paper has that?
LetterRip t1_javpxbv wrote
> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit
GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).
Flash attention is June of 2022.
Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.
https://arxiv.org/abs/2208.07339
> That seems large, which paper has that?
See
https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg
>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.
CellWithoutCulture t1_javqw9s wrote
Fantastic reply, it's great to see all those concrete advances thst made it intro prod. Thanks for sharing.
Viewing a single comment thread. View all comments