Submitted by minimaxir t3_11fbccz in MachineLearning
fmai t1_jalcs0x wrote
Reply to comment by lucidraisin in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
AFAIK, flash attention is just a very efficient implementation of attention, so still quadratic in the sequence length. Can this be a sustainable solution for when context windows go to 100s of thousands?
lucidraisin t1_jamtx7b wrote
it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.
LetterRip t1_janljeo wrote
Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.
visarga t1_jalg9iu wrote
I think the main pain point was memory usage.
Dekans t1_jamokhr wrote
> We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.
...
> FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
In the paper bold is done using the block-sparse version. The Path-X (16K length) is done using regular FlashAttention.
Viewing a single comment thread. View all comments