Thunderbird120 t1_jajok9y wrote on March 1, 2023 at 10:26 PM

I'm curious which memory efficient transformer variant they've figured out how to leverage at scale. They're obviously using one of them since they're offering models with 32k context but it's not clear which one.

lucidraisin t1_jakb7h4 wrote on March 2, 2023 at 1:08 AM

it is flash attention (Tri Dao et al)

Thunderbird120 t1_jakbyew wrote on March 2, 2023 at 1:14 AM

You're better qualified to know than nearly anyone who posts here, but is flash attention really all that's necessary to make that feasible?

lucidraisin t1_jakdtf7 wrote on March 2, 2023 at 1:27 AM

yes

edit: it was also used to train Llama. there is no reason not to use it at this point, for both training and fine-tuning / inference

fmai t1_jalcs0x wrote on March 2, 2023 at 6:29 AM

AFAIK, flash attention is just a very efficient implementation of attention, so still quadratic in the sequence length. Can this be a sustainable solution for when context windows go to 100s of thousands?

lucidraisin t1_jamtx7b wrote on March 2, 2023 at 3:46 PM

it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.

LetterRip t1_janljeo wrote on March 2, 2023 at 6:49 PM

Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.

visarga t1_jalg9iu wrote on March 2, 2023 at 7:11 AM

I think the main pain point was memory usage.

Dekans t1_jamokhr wrote on March 2, 2023 at 3:10 PM

> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

...

> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.

Hsemar t1_jalp8as wrote on March 2, 2023 at 9:12 AM

but does flash attention help with auto-regressive generation? My understanding was that it prevents materializing the large kv dot product during training. At inference (one token at a time) with kv caching this shouldn't be that relevant right?

[deleted] t1_jarikhz wrote on March 3, 2023 at 3:21 PM

[deleted]

minimaxir OP t1_jajcf4s wrote on March 1, 2023 at 9:10 PM

It's safe to assume that some of those techniques were already used in previous iterations of GPT-3/ChatGPT.

LetterRip t1_jajezib wrote on March 1, 2023 at 9:26 PM

June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.

Memory efficient attention is from a few months ago.

ChatGPT was just introduced a few months ago.

The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.

jinnyjuice t1_jalkbvu wrote on March 2, 2023 at 8:04 AM

How do we know these technical improvements result in 90% extra revenue? I feel I'm missing some link here.

[deleted] t1_jall6xi wrote on March 2, 2023 at 8:16 AM

[deleted]

Smallpaul t1_jam673c wrote on March 2, 2023 at 12:45 PM

I think you are using the word revenue when you mean profit.

LetterRip t1_jani50o wrote on March 2, 2023 at 6:23 PM

We don't know the supply demand curve, so we can't know for sure that the revenue increased.

andreichiffa t1_jajuk03 wrote on March 1, 2023 at 11:07 PM

That, and the fact that OpenAI/MS want to completely dominate LLM market, in the same way Microsoft dominated OS/browser market in the late 90s/early 2000s.

Smallpaul t1_jam6et8 wrote on March 2, 2023 at 12:47 PM

They’ll need a stronger story around lock-in if that’s their strategy. One way would be to add structured and unstructured data storage to the APIs.

bjergerk1ng t1_jakszgr wrote on March 2, 2023 at 3:20 AM

Is it possible that they also switched from non-chinchilla-optimal davinci to chinchilla-optimal chatgpt? That is at least 4x smaller

LetterRip t1_jal4y8i wrote on March 2, 2023 at 5:05 AM

Certainly that is also a possibility. Or they might have done teacher student distillation.

[deleted] t1_jamt0wc wrote on March 2, 2023 at 3:40 PM

[deleted]

Pikalima t1_janc14v wrote on March 2, 2023 at 5:43 PM

I’d say we need an /r/VXJunkies equivalent for statistical learning theory, but the real deal is close enough.

[deleted] t1_jarj0kn wrote on March 3, 2023 at 3:24 PM

[deleted]

cv4u t1_jakzhqj wrote on March 2, 2023 at 4:14 AM

LLMs can quantize to 8 bit or 4 bit?

LetterRip t1_jal4vgs wrote on March 2, 2023 at 5:04 AM

Yep, or a mix between the two.

GLM-130B quantized to int4, OPT and BLOOM int8,

https://arxiv.org/pdf/2210.02414.pdf

Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),

Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS

https://arxiv.org/pdf/2103.13630.pdf

Here is a talk on auto48 (automatic mixed int4/int8 quantization)

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/

londons_explorer t1_jam6oyr wrote on March 2, 2023 at 12:49 PM

Aren't biases only a tiny tiny fraction of the total memory usage? Is it even worth trying to quantize them more than weights?

londons_explorer t1_jam6r8g wrote on March 2, 2023 at 12:50 PM

Don't you mean the other way around?

tomd_96 t1_jamp6kt wrote on March 2, 2023 at 3:14 PM

Where was this introduced?

CellWithoutCulture t1_javhjpc wrote on March 4, 2023 at 11:29 AM

I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

> memory efficient attention. 10x-20x increase in batch size.

That seems large, which paper has that?

LetterRip t1_javpxbv wrote on March 4, 2023 at 1:07 PM

> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit

GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).

Flash attention is June of 2022.

Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.

https://arxiv.org/abs/2208.07339

> That seems large, which paper has that?

See

https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg

>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.

https://github.com/HazyResearch/flash-attention

CellWithoutCulture t1_javqw9s wrote on March 4, 2023 at 1:17 PM

Fantastic reply, it's great to see all those concrete advances thst made it intro prod. Thanks for sharing.

[D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)

LetterRip t1_jaj1kp3 wrote on March 1, 2023 at 8:04 PM

Thunderbird120 t1_jajok9y wrote on March 1, 2023 at 10:26 PM

lucidraisin t1_jakb7h4 wrote on March 2, 2023 at 1:08 AM

Thunderbird120 t1_jakbyew wrote on March 2, 2023 at 1:14 AM

lucidraisin t1_jakdtf7 wrote on March 2, 2023 at 1:27 AM

fmai t1_jalcs0x wrote on March 2, 2023 at 6:29 AM

lucidraisin t1_jamtx7b wrote on March 2, 2023 at 3:46 PM

LetterRip t1_janljeo wrote on March 2, 2023 at 6:49 PM

visarga t1_jalg9iu wrote on March 2, 2023 at 7:11 AM

Dekans t1_jamokhr wrote on March 2, 2023 at 3:10 PM

Hsemar t1_jalp8as wrote on March 2, 2023 at 9:12 AM

[deleted] t1_jarikhz wrote on March 3, 2023 at 3:21 PM

minimaxir OP t1_jajcf4s wrote on March 1, 2023 at 9:10 PM

LetterRip t1_jajezib wrote on March 1, 2023 at 9:26 PM

jinnyjuice t1_jalkbvu wrote on March 2, 2023 at 8:04 AM

[deleted] t1_jall6xi wrote on March 2, 2023 at 8:16 AM

Smallpaul t1_jam673c wrote on March 2, 2023 at 12:45 PM

LetterRip t1_jani50o wrote on March 2, 2023 at 6:23 PM

andreichiffa t1_jajuk03 wrote on March 1, 2023 at 11:07 PM

Smallpaul t1_jam6et8 wrote on March 2, 2023 at 12:47 PM

bjergerk1ng t1_jakszgr wrote on March 2, 2023 at 3:20 AM

LetterRip t1_jal4y8i wrote on March 2, 2023 at 5:05 AM

[deleted] t1_jamt0wc wrote on March 2, 2023 at 3:40 PM

Pikalima t1_janc14v wrote on March 2, 2023 at 5:43 PM

[deleted] t1_jarj0kn wrote on March 3, 2023 at 3:24 PM

cv4u t1_jakzhqj wrote on March 2, 2023 at 4:14 AM

LetterRip t1_jal4vgs wrote on March 2, 2023 at 5:04 AM

londons_explorer t1_jam6oyr wrote on March 2, 2023 at 12:49 PM

londons_explorer t1_jam6r8g wrote on March 2, 2023 at 12:50 PM

tomd_96 t1_jamp6kt wrote on March 2, 2023 at 3:14 PM

CellWithoutCulture t1_javhjpc wrote on March 4, 2023 at 11:29 AM

LetterRip t1_javpxbv wrote on March 4, 2023 at 1:07 PM

CellWithoutCulture t1_javqw9s wrote on March 4, 2023 at 1:17 PM