Viewing a single comment thread. View all comments

lucidraisin t1_jcl0y16 wrote

it is important for everyone to know that there may be a capacity limit to the context length, as explored by this paper. gpt4 may not have this limit, but smaller variants like llama may. it also depends on the task you are trying to solve. you cannot just get 'infinite context', as some would sell you that their network can do. more research needed... hopefully pytorch 2.0 leads to that

27

super_deap OP t1_jcl1omd wrote

Thanks for that paper; I came across it a while ago but have not read it yet. Is the limit due to number of model parameters or size of embedding. I suspect size of embedding to be the biggest factor in limiting how big the context can be.

7

lucidraisin t1_jcl2rkh wrote

yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4

2

super_deap OP t1_jcl3whl wrote

That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)

2

lucidraisin t1_jcl6ecd wrote

no worries, thanks for running the experiments and sharing your results 🙏

2

antonb90 t1_jczajd1 wrote

Things are improving fast.

>COLT5 is better at any speed. For 16k input length, COLT5 matches or exceeds LONGT5 quality for Large and XL with 35-75% training speedup and 50-100% inference speedup on top of the order-of-magnitude inference speedup from MQA. Encoder speedups are even greater (Appendix D). COLT5-XL also achieves SOTA performance on the SCROLLS benchmark

​

>COLT5 achieves both stronger performance and faster inference speed at all input lengths and is able to effectively make use of extremely long inputs. We note that COLT5 achieves large quality gains by going from 32k to 64k tokens even while keeping the number of routed tokens constant, providing more evidence for our hypothesis.

Google's new COLT5 64k,

https://arxiv.org/abs/2303.09752

1

lucidraisin t1_jczarq8 wrote

that isn't for decoders. encoder only, and still needs to be verified. the majority of research paper never work out on closer examination. just trust me, stick with flash attention for now until further notice and save yourself a lot of headache

2

Unlucky_Excitement_2 t1_jczk2lm wrote

Since you're the OG with this. Can I pick your brain? You don't see value in hyena hierachrcy. Inference with 64k context window but 100x more efficient than flash attention. I notice on github, you plan on implementing flash attention on all your transformer based models? HH perplexity actually scales with parameter count scaling. Thoughts?

2

lucidraisin t1_jcznnvh wrote

actually, i'm keeping an eye on Hyena! there are however a number of issues i still have with the paper (i'm not going to play reviewer 2, as it is not my place nor is reddit a good forum for that), but i intend to reserve judgement and try it out on few difficult problems like genomics and EEG later this year. proof is in the pudding.

2

Unlucky_Excitement_2 t1_jczo8wf wrote

Those are actually super compelling problems. I'll keep an eye out. Again thank you, you contribute so much.

2

lucidraisin t1_jczoelv wrote

yea no problem, happy to chat more if you are doing research in this space. you can always reach out to me through email

1