Hi,

I did a quick experiment with Pytorch 2.0 Native scaled_dot_product_attention. I was able to a single forward pass within 9GB of memory which is astounding. I think by patching existing Pretrained GPT models and adding more positional encodings, one could easily fine-tune those models to 32k attention on a single A100 80GB. Here is the code I used:

https://preview.redd.it/6csxe28lv9oa1.png?width=607&format=png&auto=webp&v=enabled&s=1db074eaea9bb6d0b95678c2cfe39dc71cb48adf

I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.

Update: I was successfully able to start the training of GPT-2 (125M) with a context size of 8k and batch size of 1 on a 16GB GPU. Since memory scaled linearly from 4k to 8k. I am expecting, 32k would require ~64GB and should train smoothly on A100 80 GB. Also, I did not do any other optimizations. Maybe 8-bit fine-tuning can further optimize it.

Update 2: I basically picked Karpaty's nanoGPT and patched the pretrained GPT-2 by repeating the embeddings N-times. I was unable to train the model at 8k because generation would cause the crash. So I started the training for a context window of 4k on The Pile: 1 hour in and loss seems to be going down pretty fast. Also Karpaty's generate function is super inefficient, O(n^4) I think so it took forever to generate even 2k tokens. So I generate 1100 tokens just to see if the model is able to go beyond 1k limit. And it seems to be working. Here are some samples at 3k iteration.

https://preview.redd.it/o2hb25w1sboa1.png?width=1226&format=png&auto=webp&v=enabled&s=1c7c1eda0e20f5123ea7c143a286aa9bb9a48491

Update 3: I have started the training and I am publishing the training script if anyone is interested in replicating or building upon this work. Here is the complete training script:

https://gist.github.com/NaxAlpha/1c36eaddd03ed102d24372493264694c

I will post an update after the weekend once the training has progressed somewhat.

Post-Weekend Update: After ~50k iterations (the model has seen ~200 million tokens, I know this is just too small compared to 10s of billions trained by giga corps), loss only dropped from 4.6 to 4.2 on The Pile:

https://preview.redd.it/vi0fpskhsuoa1.png?width=1210&format=png&auto=webp&v=enabled&s=3fab4c767ac0cc8b7598d20566a77476e75efea1

AFAIR, the loss of GPT-2 on the Pile if trained with 1024 tokens is ~2.8. It seems like the size of the dimension for each token is kind of limiting how much loss can go down since GPT-2 (small) has an embedding dimension of 768. Maybe someone can experiment with GPT-2 medium etc. to see how much we can improve. This is confirmation of the comment by u/lucidraisin below.

Comments

You must log in or register to comment.

No-Belt7582 t1_jcjqk6s wrote on March 17, 2023 at 10:08 AM

#2,251,853

Pytorch 2.0 is really impressing everyday

ChuckSeven t1_jcjt0je wrote on March 17, 2023 at 10:40 AM

#2,251,952

Replying to No-Belt7582 (#2,251,853)

This is the way.

[deleted] t1_jcjvdyu wrote on March 17, 2023 at 11:08 AM

#2,252,094

[removed]

kittenkrazy t1_jcjxr0b wrote on March 17, 2023 at 11:34 AM

#2,252,202

I wonder if this can effectively be used in LLaMA. 32K context would be a game changer

Spiritual-Reply5896 t1_jcjz3fn wrote on March 17, 2023 at 11:48 AM

#2,252,275

Why is everyone talking about the context length, and not about some kind of memory retrieval? Is the assumption that by increasing context length we can eventually scale it to infite, thus replacing any kind of external memory?

-Rizhiy- t1_jck6j55 wrote on March 17, 2023 at 12:58 PM

#2,252,720

Replying to Spiritual-Reply5896 (#2,252,275)

Increasing the context window is a simple albeit costly method of increasing amount of addressable information. Working with external memory is not as straightforward.

CleanThroughMyJorts t1_jck7114 wrote on March 17, 2023 at 1:02 PM

#2,252,748

Replying to Spiritual-Reply5896 (#2,252,275)

I don't think the two are mutually exclusive.

The problem with retrieval though (at least current implementations) is the model can't attend to memory globally the way it does with context memory; you're bottlenecked by the retrieval process having to bring things into context through a local search.

super_deap OP t1_jck82rd wrote on March 17, 2023 at 1:10 PM

#2,252,802

Replying to Spiritual-Reply5896 (#2,252,275)

Nuance is proportion to context.

Imagine we want to ask the language model to improve a certain module in Linux Kernel.

If I understood them correctly, memory-augmented transformers won't be able to fit together all the pieces to understand what needs to be improved and how because they need to make repeated calls to memory and search/summarize those calls to get a basic understanding and thus miss out on important details.

Compare that to huge context, they have everything they need for the memory in their context and there is no loss of details (in case of full attention).

tripple13 t1_jck9593 wrote on March 17, 2023 at 1:19 PM

#2,252,871

~~Does anyone know why they didn't add the flashattention directly into the MultiheadAttention-modules?~~ Seems to be integrated, awesome!

Nhabls t1_jck9a4c wrote on March 17, 2023 at 1:20 PM

#2,252,879

Replying to kittenkrazy (#2,252,202)

Yeah just need enough training time and data to be able to train those 32k context layers effectively........................

[deleted] t1_jckd4tg wrote on March 17, 2023 at 1:51 PM

#2,253,156

Replying to CleanThroughMyJorts (#2,252,748)

[removed]

RobbinDeBank t1_jcki2vl wrote on March 17, 2023 at 2:26 PM

#2,253,498

Replying to No-Belt7582 (#2,251,853)

What are its biggest improvements over pytorch 1?

[deleted] t1_jckkb5t wrote on March 17, 2023 at 2:41 PM

#2,253,646

Replying to RobbinDeBank (#2,253,498)

[deleted]

harharveryfunny t1_jckltrp wrote on March 17, 2023 at 2:52 PM

#2,253,743

> I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.

So you mean build a model with a 32K attention window, but somehow initialize it with weights from BLOOM (2K window) then finetune ? Are you aware of any attempts to do this sort of thing ?

mrpogiface t1_jckmi7d wrote on March 17, 2023 at 2:56 PM

#2,253,782

Replying to kittenkrazy (#2,252,202)

Definitely, but you'd need to further fine-tune the model to "teach" it to make use of the additional context

super_deap OP t1_jckpoey wrote on March 17, 2023 at 3:17 PM

#2,253,965

Replying to harharveryfunny (#2,253,743)

I think one just needs to duplicate positional embeddings and we are good to go. Of course, there needs to be more comprehensive empirical analysis on this and I have not come across any of such attempts. I did a basic experiment and it seems to work but will have to wait and see.

Spiritual-Reply5896 t1_jckq519 wrote on March 17, 2023 at 3:20 PM

#2,253,994

Replying to super_deap (#2,252,802)

Lets say Linux kernel manual is embedded as memories. If we can get accurate semantic representation of the question, then we should be able to find relevant context from the memory, and use enough context to answer the question in fewer tokens compared to providing the whole Linux manual as context. If we assume that computing the attention is as fast as vector search, then its a no-brainer that retrieving only relevant context from memory is better approach than using the whole manual. Its of course a trade off between accuracy and speed/scalability, but I argue its a good tradeoff as text isn't often that information dense.

The ability to produce semantically coherent embeddings from text is the grain and salt of LLM, so why would it be any bigger problem to retrieve these memories from external / infinite database than from context window?

Im just hypothesizing with my limited knowledge, please correct me if I make stupid assumptions :)

royalemate357 t1_jckqgsr wrote on March 17, 2023 at 3:22 PM

#2,254,016

Replying to RobbinDeBank (#2,253,498)

Pretty sure the main improvement is "torch.compile" which can optimize your code in a nice easy one liner. There's some other nice quality of life improvements like the built in flash attention OP is using, and I think some distributed training stuff. But it's fully backwards compatible, which is great (looking at you tensorflow) https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever

Mindless-Ad8595 t1_jckrmcz wrote on March 17, 2023 at 3:30 PM

#2,254,081

Please keep posting your progress, this is very interesting!

super_deap OP t1_jcktps3 wrote on March 17, 2023 at 3:43 PM

#2,254,203

Replying to mrpogiface (#2,253,782)

This

[deleted] t1_jcky91i wrote on March 17, 2023 at 4:12 PM

#2,254,463

Replying to -Rizhiy- (#2,252,720)

[deleted]

[deleted] t1_jcl0iss wrote on March 17, 2023 at 4:27 PM

#2,254,585

[deleted]

lucidraisin t1_jcl0y16 wrote on March 17, 2023 at 4:30 PM

#2,254,605

it is important for everyone to know that there may be a capacity limit to the context length, as explored by this paper. gpt4 may not have this limit, but smaller variants like llama may. it also depends on the task you are trying to solve. you cannot just get 'infinite context', as some would sell you that their network can do. more research needed... hopefully pytorch 2.0 leads to that

super_deap OP t1_jcl1omd wrote on March 17, 2023 at 4:34 PM

#2,254,650

Replying to lucidraisin (#2,254,605)

Thanks for that paper; I came across it a while ago but have not read it yet. Is the limit due to number of model parameters or size of embedding. I suspect size of embedding to be the biggest factor in limiting how big the context can be.

lucidraisin t1_jcl2rkh wrote on March 17, 2023 at 4:41 PM

#2,254,719

Replying to super_deap (#2,254,650)

yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4

petitponeyrose t1_jcl3hdz wrote on March 17, 2023 at 4:46 PM

#2,254,756

!RemindMe 5 days

RemindMeBot t1_jcl3jkl wrote on March 17, 2023 at 4:46 PM

#2,254,759

Replying to petitponeyrose (#2,254,756)

I will be messaging you in 5 days on 2023-03-22 16:46:07 UTC to remind you of this link

5 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

super_deap OP t1_jcl3whl wrote on March 17, 2023 at 4:48 PM

#2,254,777

Replying to lucidraisin (#2,254,719)

That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)

[deleted] t1_jcl50r7 wrote on March 17, 2023 at 4:55 PM

#2,254,842

Replying to super_deap (#2,254,777)

[deleted]

Screye t1_jcl549n wrote on March 17, 2023 at 4:56 PM

#2,254,848

Replying to Spiritual-Reply5896 (#2,252,275)

Context length is also a hard limit on how many logical-hops the model can make.

If each back-n-forth takes 500-ish tokens, then the model can only reason over 16 hops over 8k tokens. With 32k tokens, it can reason over 64 hops. This might allow for emergent behaviors towards tasks that have previously been deemed impossible due to needing at least a minimum number of hops to reason about.

For what it's worth, I think memory retrieval will work just fine for 90% of scenarios and will stay relevant even for 32k tokens. Esp. if the wiki you are retrieving from is millions of lines.

hfnuser0000 t1_jcl5akd wrote on March 17, 2023 at 4:57 PM

#2,254,858

Replying to Spiritual-Reply5896 (#2,252,275)

How many ways are there to implement memory retrieval? Can someone explain the intuition behind it? Thank you in advance!

lucidraisin t1_jcl6ecd wrote on March 17, 2023 at 5:04 PM

#2,254,922

Replying to super_deap (#2,254,777)

no worries, thanks for running the experiments and sharing your results 🙏

BungaBunga6767 t1_jcl6vf9 wrote on March 17, 2023 at 5:07 PM

#2,254,952

Replying to harharveryfunny (#2,253,743)

LongFormer does it but not with FlashAttention

Competitive-Rub-1958 t1_jcl97q0 wrote on March 17, 2023 at 5:22 PM

#2,255,078

Replying to No-Belt7582 (#2,251,853)

> Either autograd is disabled (using torch.inference_mode or torch.no_grad) or no tensor argument requires_grad > training is disabled (using .eval())

What's the point of FlashAttention if you can't use it during training? 🤔

https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

[deleted] t1_jclc4rv wrote on March 17, 2023 at 5:40 PM

#2,255,243

[removed]

fastinguy11 t1_jcle8cn wrote on March 17, 2023 at 5:54 PM

#2,255,378

Replying to Nhabls (#2,252,879)

Gpt4 32 k api when available ?

felheartx t1_jcli6si wrote on March 17, 2023 at 6:19 PM

#2,255,613

Replying to -Rizhiy- (#2,252,720)

You said working with external memory is not as straightforward. Can you explain that?

I've read this: https://arxiv.org/abs/2301.04589# and even though I'm not super familiar with the details, to my untrained eye it seems like attaching external memory is easier than extending the context size.

Just from reading posts on this subreddit I get the feeling that getting larger and larger context sizes is very difficult. Whereas simply attaching this sort of "dictionary" thing is pretty easy to do.

felheartx t1_jclij93 wrote on March 17, 2023 at 6:21 PM

#2,255,631

Replying to hfnuser0000 (#2,254,858)

I have no idea about how many other ways there are, but this looks extremely promising: https://arxiv.org/abs/2301.04589#

So there's at least one :P

kreuzguy t1_jcliuwx wrote on March 17, 2023 at 6:23 PM

#2,255,647

Replying to kittenkrazy (#2,252,202)

Someone should definitely look into this!

lmericle t1_jcln487 wrote on March 17, 2023 at 6:51 PM

#2,255,878

Replying to felheartx (#2,255,613)

You will find that in hype circles such as NLP there's a lot of thought-terminating cliches passed around by people who are not so deep in the weeds. Someone says something with confidence, another person doesn't know how to vet it and so just blindly passes it on, and all of a sudden a hack becomes a rumor becomes dogma. It seems to me to be this way with context vs memory.

Put another way: it's the kind of attitude that says "No, Mr. Ford, what we wanted was faster horses".

KerfuffleV2 t1_jclo0oh wrote on March 17, 2023 at 6:57 PM

#2,255,913

Replying to felheartx (#2,255,613)

I'm not an ML person, but it seems like that paper is just teaching the LLM to simulate a Turing machine. Actually making it respond normally while doing practical stuff like answering user queries would be a different thing.

Also, suppose the LLM has access to external memory. First, you have to teach it how to interact with that external memory (via special command sequences in its tokens, most likely). Then you have to teach it/take steps to make it appropriately note which things are important or not and store/retrieve them as necessary. All of this requires tokens for input/output so it will increase processing time even when used perfectly, these tokens will also consume the existing context window.

One really big thing with LLMs now is it seems like they don't (and maybe can't) know what they know/don't know. They just predict tokens, they can't really do introspection. Of course, they can be trained to respond that they don't know certain things, but getting the LLM to decide it needs to use the external memory doesn't seem like the simplest thing.

I mean, take humans as an example: Are you effective at taking notes, organizing them in a way that lets you easily recall them in the future, etc? It's not even an easy skill for humans to develop, and we're relatively good at knowing what we don't know.

Another thing is the paper you linked to says it set the temperature to 0, to make the responses very deterministic. Generally this makes them a lot less creative as well. If you turn up temperature, you potentially increase the chances that the LLM generates malformed queries for the external memory or stuff like that.

Anyway, I don't know much about the technical side of increasing the context window but when the context window is bigger the thing can just use it as far as I know. Taking advantage of some sort of external memory system seems like it's a very, very complicated thing to solve reliably.

Again, note this is coming from someone that doesn't really know much about ML, LLMs, etc. I'm just a normal developer, so take all this with a grain of salt.

MoistYogurtcloset400 t1_jclv6r1 wrote on March 17, 2023 at 7:44 PM

#2,256,312

Replying to royalemate357 (#2,254,016)

Is this torch.compile only compatible with cuda device only?

cthorrez t1_jclwi3d wrote on March 17, 2023 at 7:52 PM

#2,256,398

Replying to Competitive-Rub-1958 (#2,255,078)

Fast inference is also important to anyone who wants to deploy these things.

mike94025 t1_jcly3mi wrote on March 17, 2023 at 8:03 PM

#2,256,472

Replying to Competitive-Rub-1958 (#2,255,078)

Documentation was not updated. Yes, you can use flash attention for training.

The first version included only forward() as we were resolving some issues with backward(). Docstring will be updated.

royalemate357 t1_jclz4t0 wrote on March 17, 2023 at 8:09 PM

#2,256,539

Replying to MoistYogurtcloset400 (#2,256,312)

hmm, I am not too sure but their blogpost says this:

>TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs.

so it seems like they support CPU. I also tried it briefly on google colab CPU-only, and it seems to work (i didn't benchmark speed though). I doubt it supports non cuda GPUs but then again support for those even in the general case isnt very good.

VarietyElderberry t1_jcm4ghk wrote on March 17, 2023 at 8:45 PM

#2,256,812

Replying to Screye (#2,254,848)

Could you explain what you mean with a logical-hop and how it is dependent on a certain number of tokens? If you are referring to a paper, a link would be appreciated.

Competitive-Rub-1958 t1_jcm5ahk wrote on March 17, 2023 at 8:50 PM

#2,256,851

Replying to mike94025 (#2,256,472)

cool! So I just need to enable `flash_sdp`, then ensure I'm basically computing self-attention and have `batch_first=True`. Would that be correct?

Available_Lion_652 t1_jcm8ub5 wrote on March 17, 2023 at 9:14 PM

#2,257,071

Some heroes don't wear a cape

HateRedditCantQuitit t1_jcmdot7 wrote on March 17, 2023 at 9:48 PM

#2,257,363

Replying to Spiritual-Reply5896 (#2,252,275)

I think of context as a end-to-end connected version of retrieval. You can backprop from loss to retrieved info, but you also want to backprop from loss to the non-retrieved info, which would basically be equivalent to having it all in context (in a handwavy way). Which is to say that just having more context is a simple solution.

I think everyone knows increasing context length is not 100% sufficient, but it sure is a simple convenient solution.

mike94025 t1_jcmepd6 wrote on March 17, 2023 at 9:55 PM

#2,257,404

Replying to cthorrez (#2,256,398)

With Better Transformer, ¿Por que no los dos?

cthorrez t1_jcmhg8m wrote on March 17, 2023 at 10:14 PM

#2,257,561

Replying to mike94025 (#2,257,404)

Because the algorithm seems to only work on inference. Probably due to memory management of the cached activations or something. (Idk the actual technical reasons)

mike94025 t1_jcmho8t wrote on March 17, 2023 at 10:16 PM

#2,257,569

Replying to Competitive-Rub-1958 (#2,256,851)

Don't call flash_sdp directly. That way you're locked into particular hardware and create non-portable models. You can either use F.scaled_dot_product_attention() , or you use nn.MultiHeadAttention. In either case it will pick the right implementation based on the hardware you have, and the constraints. Ideally, the constraints would be weakened in the future, and/or new kernels might support other operating points in an optimized manner, and then the kernel picker can dispatch to that implementation.

See the kernel-picker logic that dispatches based on input characteristics in the source code, and/or the SDPA tutorial here => https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html

mike94025 t1_jcmlddm wrote on March 17, 2023 at 10:42 PM

#2,257,725

Replying to cthorrez (#2,257,561)

Better Transformer supports both, today. Some optimizations are still inference-only (and in particular support for variable-sequence length Nested Tensor) and the inference fastpath is a bit silo'ed, but nothing that future PyTorch update could not fix.

Screye t1_jcmpd5i wrote on March 17, 2023 at 11:11 PM

#2,257,966

Replying to VarietyElderberry (#2,256,812)

This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.

Simply put:

The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.

Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.

[deleted] t1_jcmtgb4 wrote on March 17, 2023 at 11:41 PM

#2,258,214

Replying to mike94025 (#2,257,725)

[removed]

mike94025 t1_jcn7ksu wrote on March 18, 2023 at 1:26 AM

#2,259,003

Replying to royalemate357 (#2,256,539)

Works for all. You need a compiler backend that can code-gen for your target, and need a frontend for the optimizer that can process the IR.

Alternatively, you need a backend for Triton (or another already supported optimizer) that can codegen for your target architecture.

Competitive-Rub-1958 t1_jcn8bti wrote on March 18, 2023 at 1:32 AM

#2,259,050

Replying to mike94025 (#2,257,569)

cool. I just wanted to make it explicit to make sure I'm running `FlashAttention`. Perhaps there's an easy way to check that?

Art10001 t1_jcnakzz wrote on March 18, 2023 at 1:50 AM

#2,259,180

Replying to KerfuffleV2 (#2,255,913)

There is a github project that uses embeddings with GPT-3.5 to create infinite memory, as long as you have infinite disk space. The database grows and grows the more you talk.

KerfuffleV2 t1_jcncad2 wrote on March 18, 2023 at 2:05 AM

#2,259,311

Replying to Art10001 (#2,259,180)

You'd have to link me what you're talking about for me to say anything. I doubt it works as straightforwardly as "infinite memory" though.

royalemate357 t1_jcnjaeo wrote on March 18, 2023 at 3:04 AM

#2,259,755

Replying to mike94025 (#2,259,003)

oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.

hfnuser0000 t1_jcnspad wrote on March 18, 2023 at 4:33 AM

#2,260,245

Replying to Art10001 (#2,259,180)

Hi there! It sounds really interesting! Could you please share the name of the project or provide a link to it? I would love to check it out. Thank you!

Art10001 t1_jcnv4kf wrote on March 18, 2023 at 4:59 AM

#2,260,389

Replying to KerfuffleV2 (#2,259,311)

https://github.com/LagPixelLOL/ChatGPTCLIBot

There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"

Art10001 t1_jcnv4te wrote on March 18, 2023 at 4:59 AM

#2,260,390

Replying to hfnuser0000 (#2,260,245)

https://github.com/LagPixelLOL/ChatGPTCLIBot

There are other similar projects I found while trying to recover this one, which may also be of interest. You can find them by searching "chatgpt embeddings memory github"

programmerChilli t1_jcny4qx wrote on March 18, 2023 at 5:33 AM

#2,260,548

Replying to royalemate357 (#2,256,539)

We currently officially support Cuda and CPU, although in principle it could be used for other backends too.

programmerChilli t1_jcnydmw wrote on March 18, 2023 at 5:35 AM

#2,260,559

Replying to tripple13 (#2,252,871)

I think it is used in Pytorch’s nn.transformerencoder but a lot of people like implementing their own.

tysam_and_co t1_jco9im0 wrote on March 18, 2023 at 8:08 AM

#2,261,069

I have a feeling that's because the heuristic is likely not giving you flash attention at all, but instead this kernel, which appropriately fits your usecase and is in the list of possibly-automatically-selected kernels: https://arxiv.org/abs/2112.05682

JustOneAvailableName t1_jcobgnf wrote on March 18, 2023 at 8:36 AM

#2,261,137

Replying to mike94025 (#2,256,472)

That is a very nice surprise

KerfuffleV2 t1_jcp7qcz wrote on March 18, 2023 at 2:31 PM

#2,262,826

Replying to Art10001 (#2,260,389)

I'm not sure I fully understand it, but it seems like it's just basically adding context to the prompt it submits with requests. For obvious reasons, the prompt can only get so big. It also requires making requests to OpenAI's embedding API which isn't free: so it's both pushing in more tokens and making those extra requests.

I can definitely see how that approach could produce better results, but it's also not really unlimited memory. Note: I skimmed the source, but I'm not really a C++ person and I didn't actually set it up to use my OpenAI account via API.

Sad-Comedian-711 t1_jcpvahm wrote on March 18, 2023 at 5:12 PM

#2,264,089

So there is flash attention and then there is block sparse flash attention.

Flash attention by itself only got them to 16k on an A100 for their model, to go further they needed to use windowed attention... You could have already gone to 16k with windowed attention before this paper without much issue.

The special thing about this windowed attention is that it is in blocks that can fit into SRAM. From what I can tell Python's implementation of Flash Attention doesn't look like it supports block sparse flash attention.

https://github.com/pytorch/pytorch/blob/eb32bb2ca6811ea21002699f4be884d3012dc362/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_fprop_kernel_1xN.h

While Triton's looks like it does:https://github.com/openai/triton/blob/c9740f0870f6ae2480acd2a76a5fb4c920bc5ce5/python/triton/ops/flash_attention.py

I think windowing must be done in blocks that align with the SRAM grid so it kinda has to be part of the Flash Attention implementation. You might be able to throw normal Big Bird block sparse attention on top...

You also may be able to call out to triton's implementation:
https://github.com/violethaze74/pumpkin-py/blob/d9250933bec045e6add61b3930ff3dbbe08f6501/aten/src/ATen/native/transformers/attention.cpp#L726

127-0-0-1_1 t1_jcqd8se wrote on March 18, 2023 at 7:15 PM

#2,264,884

Replying to KerfuffleV2 (#2,262,826)

It's not unlimited memory in a single run, which remains unchanged, but that doesn't seem super relevant to what people want (nothing wrong with multiple runs!). Think about a turing machine, or heck, yourself. A turing machine only has access to a single cell of memory at at time, and in practice, modern CPUs only have access to their registers directly. For long term storage, that goes into RAM, which is accessed on demand.

Similarly, your own memory is not large enough to contain all the information you'd need to complete most complex tasks. That's why you have to write things down and actively try to remember things.

While that uses OpenAI's embedding networks, like the autoregressive LLM itself, it's not like OpenAI has a monopoly on text embeddings by any means (far from it - embeddings have a very straightforward business use and are used in practically any major site you know for things like similarity queries).

While I think OP is overhyping the degree to which this is "infinite memory" yet, in a hypothetical turing machine formulation where the network can more proactively store and restore memory, it would allow for it to be, at least, turing complete.

Sad-Comedian-711 t1_jcqgv1x wrote on March 18, 2023 at 7:40 PM

#2,265,043

Replying to super_deap (#2,253,965)

This approach has been shown to work. Longformer even provided a script that did this for you: https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

I think for flash attention you do not want to use Longformer's attention though, you want to use Big Bird's with specific block sizes or something like that.

Spiritual-Reply5896 t1_jcsq4d9 wrote on March 19, 2023 at 7:22 AM

#2,268,842

Replying to 127-0-0-1_1 (#2,264,884)

Exactly, I wanted to find out whether there is some research regarding these embeddings. I really think that by efficient pruning/organization of these "memories" its possible to generate quite advanced memory. Things like embedding consistency then becomes a big player - how much does length affect the embedding, what is the optimal information content vs string size...

mike94025 t1_jcv7ltl wrote on March 19, 2023 at 8:30 PM

#2,272,957

Replying to royalemate357 (#2,259,755)

You might look into https://github.com/pytorch/pytorch/pull/95793.

mike94025 t1_jcv83hu wrote on March 19, 2023 at 8:34 PM

#2,272,982

Replying to Competitive-Rub-1958 (#2,259,050)

Yes - use the backend context manager to disable all other backends to see that you're running the one you want. (Otherwise, since all other backends are disabled, you'll get an error.)

SDPA context manager is intended to facilitate debug (for perf or correctness), and is not (and should not be) required for normal operational usage.

Check out the SPDA tutorial at https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#explicit-dispatcher-control

mike94025 t1_jcv94un wrote on March 19, 2023 at 8:41 PM

#2,273,058

Replying to programmerChilli (#2,260,559)

SDPA is used by F.multi_head_attention_forward (if need_weights=False) which is used by nn.MHA and nn.Transformer* as well as other libraries. (source)

Public service announcement: need_weights defaults to True, and guts performance. (Because allocating and writing the attention weight tensor defeats the memory BW advantages of flash attention.)

Also, if `key_padding_mask is not None` performance will suffer (because this is converted into an attention mask, and only the causal attention mask is suppprted by Flash Attention). Use Nested Tensors for variable sequence length batches.

crude2refined t1_jcwdk7j wrote on March 20, 2023 at 1:45 AM

#2,275,727

In Google colab, I'm not able to reproduce the benefits in pytorch 2 vs 1 with scaled_dot_product_attention. Is there anything I'm missing? Please see attached image: https://imgur.com/72FKcp1

mike94025 t1_jcx5xvg wrote on March 20, 2023 at 6:28 AM

#2,277,365

Replying to crude2refined (#2,275,727)

Data type?

SDPA currently has 3 kernels implemented by a kernel picker.

sdpa_math
sdpa_flash
sdpa_mem_eff

A kernel picker picks the best given your constraints

Math is the trusted kernel from the equation in the paper.
Flash only works for FP16 and BF16, and on SM80 (e.g., A100).
mem_efficient kernel works on older architecture levels, and supports FP32, but the upside is limited due to lack of compute capacity for FP32. FP16 or BF16 should help. Also, there are requirements on alignment, dropout values etc to qualify for the high-perf SDPA implementations. Dropout required to be 0 @ PT2.0

Also, different kernels parallelize across different dimensions, so B=1 will not work with all of those kernels.

In a nutshell, performance comes at the price of generality, and GPUs are finnecky to get the performance, so our inputs must adhere to those, and parallelization strategies matter for different combinations of dimensions.

antonb90 t1_jczajd1 wrote on March 20, 2023 at 6:20 PM

#2,282,011

Replying to lucidraisin (#2,254,605)

Things are improving fast.

>COLT5 is better at any speed. For 16k input length, COLT5 matches or exceeds LONGT5 quality for Large and XL with 35-75% training speedup and 50-100% inference speedup on top of the order-of-magnitude inference speedup from MQA. Encoder speedups are even greater (Appendix D). COLT5-XL also achieves SOTA performance on the SCROLLS benchmark

>COLT5 achieves both stronger performance and faster inference speed at all input lengths and is able to effectively make use of extremely long inputs. We note that COLT5 achieves large quality gains by going from 32k to 64k tokens even while keeping the number of routed tokens constant, providing more evidence for our hypothesis.

Google's new COLT5 64k,

https://arxiv.org/abs/2303.09752

lucidraisin t1_jczarq8 wrote on March 20, 2023 at 6:22 PM

#2,282,035

Replying to antonb90 (#2,282,011)

that isn't for decoders. encoder only, and still needs to be verified. the majority of research paper never work out on closer examination. just trust me, stick with flash attention for now until further notice and save yourself a lot of headache

[deleted] t1_jczdwt5 wrote on March 20, 2023 at 6:42 PM

#2,282,259

Replying to lucidraisin (#2,282,035)

[deleted]

Unlucky_Excitement_2 t1_jczk2lm wrote on March 20, 2023 at 7:21 PM

#2,282,705

Replying to lucidraisin (#2,282,035)

Since you're the OG with this. Can I pick your brain? You don't see value in hyena hierachrcy. Inference with 64k context window but 100x more efficient than flash attention. I notice on github, you plan on implementing flash attention on all your transformer based models? HH perplexity actually scales with parameter count scaling. Thoughts?

lucidraisin t1_jcznnvh wrote on March 20, 2023 at 7:44 PM

#2,282,967

Replying to Unlucky_Excitement_2 (#2,282,705)

actually, i'm keeping an eye on Hyena! there are however a number of issues i still have with the paper (i'm not going to play reviewer 2, as it is not my place nor is reddit a good forum for that), but i intend to reserve judgement and try it out on few difficult problems like genomics and EEG later this year. proof is in the pudding.

Unlucky_Excitement_2 t1_jczo8wf wrote on March 20, 2023 at 7:48 PM

#2,283,006

Replying to lucidraisin (#2,282,967)

Those are actually super compelling problems. I'll keep an eye out. Again thank you, you contribute so much.

lucidraisin t1_jczoelv wrote on March 20, 2023 at 7:49 PM

#2,283,015

Replying to Unlucky_Excitement_2 (#2,283,006)

yea no problem, happy to chat more if you are doing research in this space. you can always reach out to me through email

oathbreakerkeeper t1_jd0lu2p wrote on March 20, 2023 at 11:34 PM

#2,285,350

Replying to mike94025 (#2,256,472)

Am I looking in the wrong place? It seems like the torch 2.0 code still requires training==False in order to use FlashAttention:

https://github.com/pytorch/pytorch/blob/663e7c9eeb66fb049b8487a6a5a7ea4311fb53d3/torch/nn/modules/activation.py#L1139

Dependent_Ad5120 t1_jd1d00j wrote on March 21, 2023 at 2:53 AM

#2,287,149

Replying to mike94025 (#2,256,472)

It seems to me that I have to call model.eval() to use the memory_efficient attention. Otherwise, it throws an error of no available kernel.

I tried on both rtx 3090 and A100, in both cases, it seems only have enable_flash=True resulted in the same error of no available kernel, even with model.eval().

So my questions are:

with model.eval(), does it mean drop_out is not enabled during training?
Am I doing something wrong for flash attention? How do I actually enable it?

Thanks a lot!

Dependent_Ad5120 t1_jd3knio wrote on March 21, 2023 at 4:17 PM

#2,292,313

Replying to Dependent_Ad5120 (#2,287,149)

OK, I found out why. To use flash attention, I had to use fp16. It is a bit faster then using memory_efficient attention in my test.

Dependent_Ad5120 t1_jd3m0ce wrote on March 21, 2023 at 4:25 PM

#2,292,410

Replying to oathbreakerkeeper (#2,285,350)

try fp16, that doesn't require training=False apparently.

Competitive-Rub-1958 t1_jd40cwb wrote on March 21, 2023 at 5:56 PM

#2,293,474

Replying to mike94025 (#2,272,982)

would that mean for forcing MHA to use it, I should wrap the ctxmanager around the line where I forward through it?

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_mem_efficient=True):
            x = x + self.attn_head(x, x, x, need_weights=False)[0]

because that doesn't really seem to work :(

oathbreakerkeeper t1_jd43931 wrote on March 21, 2023 at 6:14 PM

#2,293,645

Replying to Dependent_Ad5120 (#2,292,410)

I'm using amp mixed precision which should be using fp16. It still requires training==false.

But the torch code also disables flash attention if autocast is enabled I'm not sure how to deal with that one.

Dependent_Ad5120 t1_jdec7kx wrote on March 23, 2023 at 7:57 PM

#2,323,297

Replying to oathbreakerkeeper (#2,293,645)

I don't know. I was using pure fp16, no autocast and it works.

oathbreakerkeeper t1_jdgjte0 wrote on March 24, 2023 at 6:14 AM

#2,331,342

Replying to Dependent_Ad5120 (#2,323,297)

How do you use pure fp16 out of curiosity? I've only ever trained with mixed precision, letting pytorch handle the fp16 stuff from there.

Do you have an example of a github repo that does it?

mike94025 t1_je5mfa8 wrote on March 29, 2023 at 4:10 PM

#2,451,698

Replying to Competitive-Rub-1958 (#2,293,474)

This doesn't force it. It says that flash is enabled, and stone others. To force it, you have to disable all other kernels. Then it’s flash or bust.

You can find more in our blog which got published today and the SDPA tutorial. Both are linked here https://www.linkedin.com/posts/michael-gschwind-3704222_pytorch-activity-7046773418288955393-gOSh

PS: the context manager can be used anywhere outside the call as well, including around the call to model.forward.

mike94025 t1_je5nrdi wrote on March 29, 2023 at 4:18 PM

#2,451,960

Replying to oathbreakerkeeper (#2,285,350)

You’re looking in the wrong place. What you’re looking at is the BT gen1 fastpath, not the BT gern 2 custom kernels.

You need to look at F.multi_head_attention_forward().

The fastpath still services inference until a full rewrite of activation.py for now that will hopefully be refactored in a future release. (There’s always a tension between refactoring and introducing new features under a tone and staffing constrained problem formulation.)