lucidraisin t1_jcl2rkh wrote on March 17, 2023 at 4:41 PM

Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4

super_deap OP t1_jcl3whl wrote on March 17, 2023 at 4:48 PM

That is understandable. I am working with that assumption as well. (I have failed too many of such experiments to have a blind faith 🙈)

lucidraisin t1_jcl6ecd wrote on March 17, 2023 at 5:04 PM

no worries, thanks for running the experiments and sharing your results 🙏

[deleted] t1_jcl50r7 wrote on March 17, 2023 at 4:55 PM

[deleted]